5,937 Matching Annotations
  1. Last 7 days
    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Englert et al. proposed a functional connectome-based Hopfield artificial neural network (fcHNN) architecture to reveal attractor states and activity flows across various conditions, including resting state, task-evoked, and pathological conditions. The fcHNN can reconstruct characteristics of resting-state and task-evoked brain activities. Additionally, the fcHNN demonstrates differences in attractor states between individuals with autism and typically developing individuals.

      Strengths:

      (1) The study used seven datasets, which somewhat ensures robust replication and validation of generalization across various conditions.

      (2) The proposed fcHNN improves upon existing activity flow models by mimicking artificial neural networks, thereby enhancing the representational ability of the model. This advancement enables the model to more accurately reconstruct the dynamic characteristics of brain activity.

      (3) The fcHNN projection offers an interesting visualization, allowing researchers to observe attractor states and activity flow patterns directly.

      We are grateful to the reviewer for highlighting the robustness of our findings across multiple datasets and for appreciating the novelty and representational advantages of our fcHNN model (which has been renamed to fcANN in the revised manuscript).

      Weaknesses:

      (1) The fcHNN projection can offer low-dimensional dynamic visualizations, but its interpretability is limited, making it difficult to make strong claims based on these projections. The interpretability should be enhanced in the results and discussion.

      We thank the reviewer for these important points. We agree that the interpretability of the low-dimensional projection is limited. In the revised manuscript, we have reframed the fcANN projection primarily as a visualization tool (see e.g. line 359) and moved the corresponding part of Figure 2 to the Supplementary Material (Supplementary Figure 2). We have also implemented a substantial revision of the manuscript, which now directly links our analysis to the novel theoretical framework of self-orthogonalizing attractor networks (Spisak & Friston, 2025), opening several new avenues in terms of interpretation and shedding light on the computational principles underlying attractor dynamics in the brain (see the revised introduction and the new section “Theoretical background”, starting at lines 128, but also the Mathematical Appendices 1-2 in the Supplementary Material for a comprehensive formal derivation). As part of these efforts, we now provide evidence for the brain’s functional organization approximating a special, computationally efficient class of attractor networks, the so-called Kanter-Sompolinsky projector network (Figure 2A-C, line 346, see also our answer to your next comment). This is exactly, what the theoretical framework of free-energy-minimizing attractor networks predicts.

      (2) The presentation of results is not clear enough, including figures, wording, and statistical analysis, which contributes to the overall difficulty in understanding the manuscript. This lack of clarity in presenting key findings can obscure the insights that the study aims to convey, making it challenging for readers to fully grasp the implications and significance of the research.

      We have thoroughly revised the manuscript for clarity in wording, figures (see e.g. lines 257, 482, 529 in the Results and lines 1128, 1266, 1300, 1367 in the Methods). We carefully improved statistical reporting and ensured that we always report test statistics, effect sizes and clearly refer to the null modelling approach used (e.g. lines 461, 542, 550, 565, 573, 619, as well as Figures 2-4). As absolute effect sizes, in many analyses, do not have a straightforward interpretation, we provided Glass’ , as a standardized effect size measure, expressing the distance of the true observation from the null distribution as a ratio of the null standard deviation. To further improve clarity, we now clearly define our research questions and the corresponding analyses and null models in the revised manuscript, both in the main text and in two new tables (Tables 1 and 2). We denoted research questions and null model with Q1-7 and NM1-5, respectively and refer to them at multiple instances when detailing the analyses and the results.

      Reviewer #2 (Public Review):

      Summary:

      Englert et al. use a novel modelling approach called functional connectome-based Hopfield Neural Networks (fcHNN) to describe spontaneous and task-evoked brain activity and the alterations in brain disorders. Given its novelty, the authors first validate the model parameters (the temperature and noise) with empirical resting-state function data and against null models. Through the optimisation of the temperature parameter, they first show that the optimal number of attractor states is four before fixing the optimal noise that best reflects the empirical data, through stochastic relaxation. Then, they demonstrate how these fcHNN-generated dynamics predict task-based functional activity relating to pain and self-regulation. To do so, they characterise the different brain states (here as different conditions of the experimental pain paradigm) in terms of the distribution of the data on the fcHNN projections and flow analysis. Lastly, a similar analysis was performed on a population with autism condition. Through Hopfield modeling, this work proposes a comprehensive framework that links various types of functional activity under a unified interpretation with high predictive validity.

      Strengths:

      The phenomenological nature of the Hopfield model and its validation across multiple datasets presents a comprehensive and intuitive framework for the analysis of functional activity. The results presented in this work further motivate the study of phenomenological models as an adequate mechanistic characterisation of large-scale brain activity.

      Following up on Cole et al. 2016, the authors put forward a hypothesis that many of the changes to the brain activity, here, in terms of task-evoked and clinical data, can be inferred from the resting-state brain data alone. This brings together neatly the idea of different facets of brain activity emerging from a common space of functional (ghost) attractors.

      The use of the null models motivates the benefit of non-linear dynamics in the context of phenomenological models when assessing the similarity to the real empirical data.

      We thank the reviewer for recognizing the comprehensive and intuitive nature of our framework and for acknowledging the strength of our hypothesis that diverse brain activity facets emerge from a common resting state attractor landscape.

      Weaknesses:

      While the use of the Hopfield model is neat and very well presented, it still begs the question of why to use the functional connectome (as derived by activity flow analysis from Cole et al. 2016). Deriving the functional connectome on the resting-state data that are then used for the analysis reads as circular.

      We agree that starting from functional couplings to study dynamics is in stark contrast with the common practice of estimating the interregional couplings based on structural connectome data. We now explicitly discuss how this affects the scope of the questions we can address with the approach, with explicit notes on the inability of this approach to study the structure-function coupling and its limitations in deriving mechanistic insights at the level of biophysical implementation.

      Line 894:

      “The proposed approach is not without limitations. First, as the proposed approach does not incorporate information about anatomical connectivity and does not explitly model biophysical details. Thus, in its present form, the model is not suitable to study the structure-function coupling and cannot yiled mechanistic explanations underlying (altered) polysynaptic connections, at the level of biophysical details.”

      We are confident, however, that our approach is not circular. At the high level, our approach can be considered as a function-to-function generative model, with twofold aims.

      First, we link large-scale brain dynamics to theoretical artificial neural network models and show that the functional connectome display characteristics that render it as an exceptionally “well-behaving” attractor network (e.g. superior convergence properties, as contrasted against appropriate respective null models). In the revised manuscript, we have significantly improved upon this aspect by explicitly linking the fcANN model to the theoretical framework of self-orthogonalizing attractor networks (Spisak & Friston, 2025) (see the revised introduction and the new section “Theoretical background”, starting at lines 128, but also the Mathematical Appendices 1-2 in the Supplementary Material for a comprehensive formal derivation). As part of these efforts, we now provide evidence for the brain’s functional organization approximating a special, computationally efficient class of attractor networks, the so-called Kanter-Sompolinsky projector network (Figure 2A-C, line 346, see also our answer to your next comment). This is exactly, what the theoretical framework of free-energy-minimizing attractor networks predicts. This result is not circular, as the empirical model does not use the key mechanism (the Hebbian/anti-Hebbian learning rule) that induces self-orthogonalization in the theoretical framework. We clarify this in the revised manuscript, e.g. in line 736.

      Second, we benchmark ability of the proposed function-to-function generative model to predict unseen data (new datasets) or data characteristics that are not directly encompassed in the connectivity matrix (e.g. non-Gaussian conditional dependencies, temporal autocorrelation, dynamical responses to perturbations on the system). These benchmarks are constructed against well defined null models, which provide reasonable references. We have now significantly improved the discussion of these null models in the revised manuscript (Tables 1 and 2, lines 257). We not only show, that our model - when reconstructing resting state dynamics - can generalize to unseen data over and beyond what is possible with the baseline descriptive measure (e.g. covariance measures and PCA), but also demonstrate the ability of the framework to reconstruct the effects of perturbations on this dynamics (such as task-evoked changes), based solely on the resting state data form another sample.

      If the fcHNN derives the basins of four attractors that reflect the first two principal components of functional connectivity, it perhaps suffices to use the empirically derived components alone and project the task and clinical data on it without the need for the fcHNN framework.

      We are thankful for the reviewer for highlighting this important point, which encouraged us to develop a detailed understanding of the origins of the close alignment between attractors and principal components (eigenvectors of the coupling matrix) and the corresponding (approximate) orthogonality. Here, we would like to emphasize that the attractor-eigenvector correspondence is by no means a general feature of any arbitrary attractor network. In fact, such networks are a very special class of attractor neural networks (the so-called Kanter-Sompolinsky projector neural network (Kanter & Sompolinsky, 1987)), with a high degree of computational efficiency, maximal memory capacity and perfect memory recall. It has been rigorously shown that in such networks, the eigenvectors of the coupling matrix (i.e. PCA on the timeseries data) and the attractors become equivalent (Kanter & Sompolinsky, 1987). This in turn made us ask the question, what are the learning and plasticity rules that drive attractor networks towards developing approximately orthogonal attractors? We found that this is a general tendency of networks obeying the free energy principle ( Figure 2A-C, line 346, see also our answer to your next comment). The formal derivation of this framework in now presented in an accompanying theoretical piece (Spisak & Friston, 2025). In the revised manuscript, we provide a short, high-level overview of these results (in the Introduction form line 55 and in the new section “Theoretical background”, line 128, but also the Mathematical Appendices 1-2 in the Supplementary Material for a comprehensive formal derivation). According to this new theoretical model, attractor states can be understood as a set of priors (in the Bayesian sense) that together constitute an optimal orthogonal basis, equipping the update process (which is akin to a Markov-chain Monte Carlo sampling) to find posteriors that generalize effectively within the spanned subspace. Thus, in sum, understanding brain function in terms of attractor dynamics - instead of PCA-like descriptive projections - provides important links towards a Bayesian interpretation of brain activity. At the same time, the eigenvector-attractor correspondence also explains, why descriptive decomposition approaches, like PCA or ICA are so effective at capturing the dynamics of the system, at the first place.

      As presented here, the Hopfield model is excellent in its simplicity and power, and it seems suited to tackle the structure-function relationship with the power of going further to explain task-evoked and clinical data. The work could be strengthened if that was taken into consideration. As such the model would not suffer from circularity problems and it would be possible to claim its mechanistic properties. Furthermore, as mentioned above, in the current setup, the connectivity matrix is based on statistical properties of functional activity amongst regions, and as such it is difficult to talk about a certain mechanism. This contention has for example been addressed in the Cole et al. 2016 paper with the use of a biophysical model linking structure and function, thus strengthening the mechanistic claim of the work.

      We agree that investigating how the structural connectome constraints macro-scale dynamics is a crucial next step. Linking our results with the theoretical framework of self-orthogonalizing attractor networks provides a principled approach to this, as the “self-orthogonalizing” learning rule in the accompanying theoretical work provides the opportunity to fit attractor networks with structural constraints to functional data, shedding light on the plastic processes which maintain the observed approximate orthogonality even in the presence of these structural constraints. We have revised the manuscript to clarify that our phenomenological approach is inherently limited in its ability to answer mechanistic questions at the level of biophysical details (lines 894) and discuss this promising direction as follows:

      Lines 803:

      “A promising application of this is to consider structural brain connectivity (as measured by diffusion MRI) as a sparsity constraint for the coupling weights and then train the fcANN model to match the observed resting-state brain dynamics. If the resulting structural-functional ANN model is able to closely match the observed functional brain substate dynamics, it can be used as a novel approach to quantify and understand the structural functional coupling in the brain”.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The statistical analyses are poorly described throughout the manuscript. The authors should provide more details on the statistical methods used for each comparison, as well as the corresponding statistics and degrees of freedom, rather than solely reporting p-values.

      We thank the reviewer for pointing this out. We have revised the manuscript to include the specific test statistics, precise p-values and raw effect sizes for all reported analyses to ensure full transparency and replicability, see e.g. lines 461, 542, 550, 565, 573, 619, as well as Figures 2-4. Additionally, as absolute effect sizes - in many analyses - do not have a straightforward interpretation, we provided Glass’ Δ, as a standardized effect size measure, expressing the distance of the true observation from the null distribution as a ratio of the null standard deviation.

      We have also improved the description of the statistical methods used in the manuscript (lines 1270, 1306, 1339, 1367, 1404) and added two overview tables (Tables 1 and 2) that summarize the methodological approaches and the corresponding null models.

      Furthermore, we have fully revised the analysis corresponding to noise optimization. We only retained null model 2 (covariance-matched Gaussian) in the main text and on Figure 3, and moved model 1 (spatial phase randomization) into the Supplementary Material (Supplementary Figure 6) and is less appropriate for this analysis (trivially significant in all cases). Furthermore, as test statistic, no we use a Wasserstein distance between the 122-dimensional empirical and the simulated data (instead of focusing on the 2-dimensional projection). This analysis now directly quantifies the capacity of the fcANN model to capture non-Gaussian conditionals in the data.

      (2) The convergence procedure is not clearly explained in the manuscript. Is this an optimization procedure to minimize energy? If so, the authors should provide more details about the optimizer used.

      We apologize for the lack of clarity. The convergence is not an optimization procedure per se, in a sense that it does not involve any external optimizer. It is simply the repeated (deterministic) application of the same update rule also known from Hopfield networks or Boltzmann machines. However, as detailed in the accompanying theoretical paper, this update rule (or inference rule) inherently solves and optimization problem: it performs gradient descent on the free energy landscape of the network. As such, it is guaranteed to converge to a local free energy minimum in the deterministic case. We have clarified this process in the Results and Methods sections as follows:

      Line 161:

      “Inference arises from minimizing free energy with respect to the states \sigma. For a single unit, this yields a local update rule homologous to the relaxation dynamics in Hopfield networks”.

      Line 181:

      “In the basis framework (Spisak & Friston, 2025), inference is a gradient descent on the variational free energy landscape with respect to the states σ and can be interpreted as a form of approximate Bayesian inference, where the expected value of the state σ<sub>i</sub> is interpreted as the posterior mean given the attractor states currently encoded in the network (serving as a macro-scale prior) and the previous state, including external inputs (serving as likelihood in the Bayesian sense)”.

      Line 1252:

      “As the inference rule was derived as a gradient descent on free energy, iterations monotonically decrease the free energy function and therefore converge to a local free‑energy minimum without any external optimizer. Thus, convergence does not require any optimization procedure with an external optimizer. Instead, it arises as the fixed point of repeated local inference updates, which implement gradient descent on free energy in the deterministic symmetric case.”

      (3) In Figure 2G, the beta values range from 0.035 to 0.06, but they are reported as 0.4 in the main text and the Supplementary Figure. Please clarify this discrepancy.

      We are grateful to the reviewer for spotting this typo. The correct value for β is 0.04, as reported in the Methods section. We have corrected this inconsistency in the revised manuscript and as well as in Supplementary Figure 5.

      (4) Line 174: What type of null model was used to evaluate the impact of the beta values? The authors did not provide details on this anywhere in the manuscript.

      We apologize for this omission. The null model is based on permuting the connectome weights while retaining the matrix symmetry, which destroys the specific topological structure but preserves the overall weight distribution. We have now clarified this at multiple places in the revised manuscript (lines 432, Table 1-2, Figure 2), and added new overview tables (Tables 1 and 2) to summarize the methodological approaches and the corresponding null models.

      (5) Figure 3B: It appears that the authors only demonstrate the reproducibility of the “internal” attractor across different datasets. What about other states?

      Thank you for noticing this. We now visualize all attractor states in Figure 3B (note that these essentially consist of two symmetric pairs).

      (6) Figure 3: What does “empirical” represent in Figure 3? Is it PCA? If the “empirical” method, which is a much simpler method, can achieve results similar to those of the fcHNN in terms of state occupancy, distribution, and activity flow, what are the benefits of the proposed method? Furthermore, the authors claim that the explanatory power of the fcHNN is higher than that of the empirical model and shows significant differences. However, from my perspective, this difference is not substantial (37.0% vs. 39.9%). What does this signify, particularly in comparison to PCA?

      This is a crucial point that is now a central theme of our revised manuscript. The reviewer is correct that the “empirical” method is PCA. PCA - by identifying variance-heavy orthogonal directions - aims to explain the highest amount of variance possible in the data (with the assumption of Gaussian conditionals). While empirical attractors are closely aligned to the PCs (i.e. eigenvectors of the inverse covariance matrix, as shown in the new analysis Q1), the alignment is only approximate. We basically take advantage of this small “gap” to quantify, weather attractor states are a better fit to the unseen data than the PCs. Obviously, due to the otherwise strong PC-attractor correspondence, this is expected to be only a small improvement. However, it is an important piece of evidence for the validity of our framework, as it shows that attractors are not just a complementary, perhaps “noisier” variety of the PCs, but a “substrate” that generalizes better to unseen data than the PCs themselves. We have revised the manuscript to clarify this point (lines 528).

      Reviewer #2 (Recommendations For The Authors):

      For clarity, it might be useful to define and use consistently certain key terms. Connectome often refers to structural (anatomical) connectivity unless defined specifically this should be considered, in Figure 1B title for example Brain state often refers to different conditions ie autism, neurotypical, sleep, etc... see for review Kringelbach et al. 2020, Cell Reports. When referring to attractors of brain activity they might be called substates.

      We thank the reviewer for these helpful suggestions. We have carefully revised the manuscript to ensure our terminology is precise and consistent. We now explicitly refer to the “functional connectome” (including the title) and avoid using the too general term “brain state” and use “substates” instead.

      In Figure 2 some terms are not defined. Noise is sigma in the text but elpsilon in the figure. Only in methods, the link becomes clear. Perhaps define epsilon in the caption for clarity. The same applies to μ in the methods. It is only described above in the methods, I suggest repeating the epsilon definition for clarity

      We appreciate this feedback and apologize for the inconsistency. We have revised all figures and the Methods section to ensure that all mathematical symbols (including ε, σ, and μ) are clearly and consistently defined upon their first appearance and in all figure captions. For instance, noise level is now consistently referred to as ϵ. We improved the consistency and clarity for other terms, too, including:

      functional connectome-based Hopfiled network (fcHNN) => functional connectivity-based attractor network (fcANN);

      temperature => inverse temperature;

      And improved grammar and language throughout the manuscript.

      References

      Kanter, I., & Sompolinsky, H. (1987). Associative recall of memory without errors. Physical Review A, 35(1), 380–392. 10.1103/physreva.35.380

      Spisak T & Friston K (2025). Self-orthogonalizing attractor neural networks emerging from the free energy principle. arXiv preprint arXiv:2505.22749.

    1. Author response:

      The following is the authors’ response to the original reviews

      We again thank the reviewers for their comments and recommendations. In response to the reviewer’s suggestions, we have performed several additional experiments, added additional discussion, and updated our conclusions to reflect the additional work. Specifically, we have performed additional analyses in female WT and Marco-deficient animals, demonstrating that the Marco-associated phonotypes observed in male mice (reduced adrenal weight, increased lung Ace mRNA and protein expression, unchanged expression of adrenal corticosteroid biosynthetic enzymes) are not present in female mice. We also report new data on the physiological consequences of increased aldosterone levels observed in male mice, namely plasma sodium and potassium titres, and blood pressure alterations in WT vs Marco-deficient male mice. In an attempt to address the reviewer’s comments relating to our proposed mechanism on the regulation of lung Ace expression, we additionally performed a co-culture experiment using an alveolar macrophage cell line and an endothelial cell line. In light of the additional evidence presented herein, we have updated our conclusions from this study and changed the title of our work to acknowledge that the mechanism underlying the reported phenotype remains incompletely understood. Specific responses to reviewers can be seen below.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The investigators sought to determine whether Marco regulates the levels of aldosterone by limiting uptake of its parent molecule cholesterol in the adrenal gland. Instead, they identify an unexpected role for Marco on alveolar macrophages in lowering the levels of angiotensin-converting enzyme in the lung. This suggests an unexpected role of alveolar macrophages and lung ACE in the production of aldosterone.

      Strengths:

      The investigators suggest an unexpected role for ACE in the lung in the regulation of systemic aldosterone levels.

      The investigators suggest important sex-related differences in the regulation of aldosterone by alveolar macrophages and ACE in the lung.

      Studies to exclude a role for Marco in the adrenal gland are strong, suggesting an extra-adrenal source for the excess Marco observed in male Marco knockout mice.

      Weaknesses:

      While the investigators have identified important sex differences in the regulation of extrapulmonary ACE in the regulation of aldosterone levels, the mechanisms underlying these differences are not explored.

      The physiologic impact of the increased aldosterone levels observed in Marco -/- male mice on blood pressure or response to injury is not clear.

      The intracellular signaling mechanism linking lung macrophage levels with the expression of ACE in the lung is not supported by direct evidence.

      Reviewer #2 (Public Review):

      Summary:

      Tissue-resident macrophages are more and more thought to exert key homeostatic functions and contribute to physiological responses. In the report of O'Brien and Colleagues, the idea that the macrophage-expressed scavenger receptor MARCO could regulate adrenal corticosteroid output at steady-state was explored. The authors found that male MARCO-deficient mice exhibited higher plasma aldosterone levels and higher lung ACE expression as compared to wild-type mice, while the availability of cholesterol and the machinery required to produce aldosterone in the adrenal gland were not affected by MARCO deficiency. The authors take these data to conclude that MARCO in alveolar macrophages can negatively regulate ACE expression and aldosterone production at steady-state and that MARCO-deficient mice suffer from secondary hyperaldosteronism.

      Strengths:

      If properly demonstrated and validated, the fact that tissue-resident macrophages can exert physiological functions and influence endocrine systems would be highly significant and could be amenable to novel therapies.

      Weaknesses:

      The data provided by the authors currently do not support the major claim of the authors that alveolar macrophages, via MARCO, are involved in the regulation of a hormonal output in vivo at steady-state. At this point, there are two interesting but descriptive observations in male, but not female, MARCO-deficient animals, and overall, the study lacks key controls and validation experiments, as detailed below.

      Major weaknesses:

      (1) According to the reviewer's own experience, the comparison between C57BL/6J wild-type mice and knock-out mice for which precise information about the genetic background and the history of breedings and crossings is lacking, can lead to misinterpretations of the results obtained. Hence, MARCO-deficient mice should be compared with true littermate controls.

      (2) The use of mice globally deficient for MARCO combined with the fact that alveolar macrophages produce high levels of MARCO is not sufficient to prove that the phenotype observed is linked to alveolar macrophage-expressed MARCO (see below for suggestions of experiments).

      (3) If the hypothesis of the authors is correct, then additional read-outs could be performed to reinforce their claims: levels of Angiotensin I would be lower in MARCO-deficient mice, levels of Antiotensin II would be higher in MARCO-deficient mice, Arterial blood pressure would be higher in MARCO-deficient mice, natremia would be higher in MARCO-deficient mice, while kaliemia would be lower in MARCO-deficient mice. In addition, co-culture experiments between MARCO-sufficient or deficient alveolar macrophages and lung endothelial cells, combined with the assessment of ACE expression, would allow the authors to evaluate whether the AM-expressed MARCO can directly regulate ACE expression.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Corticosterone levels in male Marco -/- mice are not significantly different, but there is (by eye) substantially more variability in the knockout compared to the wild type. A power analysis should be performed to determine the number of mice needed to detect a similar % difference in corticosterone to the difference observed in aldosterone between male Marco knockout and wild-type mice. If necessary the experiments should be repeated with an adequately powered cohort.

      Using a power calculator (www.gigacalculator.com) it was determined that our sample size of 13 was one less than sufficient to detect a similar % difference in corticosterone as was detected in corticosterone. We regret that we unable to perform additional measurements as the author suggested in the available timeframe.

      (2) All of the data throughout the MS (particularly data in the lung) should be presented in male and female mice. For example, the induction of ACE in the lungs of Marco-/- female mice should be absent. Similar concerns relate to the dexamethasone suppression studies. Also would be useful if the single cell data could be examined by sex--should be possible even post hoc using Xist etc.

      Given the limitations outlined in our previous response to reviewers it was not possible to repeat every experiment from the original manuscript. We were able to measure the expression of lung Ace mRNA, ACE protein, adrenal weights, adrenal expression of steroid biosynthetic enzymes, presence of myeloid cells, and levels of serum electrolytes in female animals. These are presented in figures 1G, 3B, 4A, 4E, 4F, 4I, and 4J. We have elected to not present single cell seq data according to sex as it did not indicate substantial differences between males and females in Marco or Ace expression and so does not substantively change our approach.

      (3) IF is notoriously unreliable in the lung, which has high levels of autofluorescence. This is the only method used to show ACE levels are increased in the absence of Marco. Orthogonal methods (e.g. immunoblots of flow-sorted cells, or ideally CITE-seq that includes both male and female mice) should be used.

      We used negative controls to guide our settings during acquisition of immunofluorescent images. Additionally, we also used qPCR to show an increase in Ace mRNA expression in the lung in addition to the protein level. This data was presented in the original manuscript and is further bolstered by our additional presentation of expression data for Ace mRNA and protein in female animals in this revised manuscript.

      (4) Given the central importance of ACE staining to the conclusions, validation of the antibody should be included in the supplement.

      We don’t have ACE-deficient mice so cannot do KO validation of the antibody. We did perform secondary stain controls which confirmed the signal observed is primary antibody-derived. Moreover, we specifically chose an anti-ACE antibody (Invitrogen catalogue # MA5-32741) that has undergone advanced verification with the manufacturer. We additionally tested the antibody in the brain and liver and observed no significant levels of staining.

      Author response image 1.

      (5) The link between alveolar macrophage Marco and ACE is poorly explored.

      We carried out a co-culture experiments of alveolar macrophages and endothelial cells and measure ACE/Ace expression as a consequence. This is presented in figure 5D and the discussion.

      (6) Mechanisms explaining the substantial sex difference in the primary outcome are not explored.

      This is outside the scope if this project, though we would consider exploring such experiments in future studies.

      (7) Are there physiologic consequences either in homeostasis or under stress to the increased aldosterone (or lung ACE levels) observed in Marco-/- male mice?

      We measured blood electrolytes and blood pressure in Marco-deficient and Marco-sufficient mice. The results from these experiments are presented in 4G-4M.

      Reviewer #2 (Recommendations For The Authors):

      Below is a suggestion of important control or validation experiments to be performed in order to support the authors' claims.

      (1) It is imperative to validate that the phenotype observed in MARCO-deficient mice is indeed caused by the deficiency in MARCO. To this end, littermate mice issued from the crossing between heterozygous MARCO +/- mice should be compared to each other. C57BL/6J mice can first be crossed with MARCO-deficient mice in F0, and F1 heterozygous MARCO +/- mice should be crossed together to produce F2 MARCO +/+, MARCO +/- and MARCO -/- littermate mice that can be used for experiments.

      We thank the reviewer for their comments. We recognise the concern of the reviewer but due to limited experimenter availability we are unable to undertake such a breeding programme to address this particular concern.

      (2) The use of mice in which AM, but not other cells, lack MARCO expression would demonstrate that the effect is indeed linked to AM. To this end, AM-deficient Csf2rb-deficient mice could be adoptively transferred with MARCO-deficient AM. In addition, the phenotype of MARCO-deficient mice should be restored by the adoptive transfer of wild-type, MARCO-expressing AM. Alternatively, bone marrow chimeras in which only the hematopoietic compartment is deficient in MARCO would be another option, albeit less specific for AM.

      We recognise the concern of the reviewer. We carried out a co-culture experiments of alveolar macrophages and endothelial cells and measure ACE/Ace expression as a consequence. This is presented in figure 5D and the implications explored in the discussion.

      (3) If the hypothesis of the authors is correct, then additional read-outs could be performed to reinforce their claims: levels of Angiotensin I would be lower in MARCO-deficient mice, levels of Antiotensin II would be higher in MARCO-deficient mice, Arterial blood pressure would be higher in MARCO-deficient mice, natremia would be higher in MARCO-deficient mice, while kaliemia would be lower in MARCO-deficient mice. Similar read-outs could also be performed in the models proposed in point 2).

      We measured blood electrolytes and blood pressure in Marco-deficient and Marco-sufficient mice. The results from these experiments are presented in 4G-4M.

      (4) Co-culture experiments between MARCO-sufficient or deficient alveolar macrophages and lung endothelial cells, combined with the assessment of ACE expression, would allow the authors to evaluate whether the AM-expressed MARCO can directly regulate ACE expression.

      To address this concern we carried out a co-culture experiment as described above.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful study presents Altair-LSFM, a solid and well-documented implementation of a light-sheet fluorescence microscope (LSFM) designed for accessibility and cost reduction. While the approach offers strengths such as the use of custom-machined baseplates and detailed assembly instructions, its overall impact is limited by the lack of live-cell imaging capabilities and the absence of a clear, quantitative comparison to existing LSFM platforms. As such, although technically competent, the broader utility and uptake of this system by the community may be limited.

      We thank the editors and reviewers for their thoughtful evaluation of our work and for recognizing the technical strengths of the Altair-LSFM platform, including the custom-machined baseplates and detailed documentation provided to promote accessibility and reproducibility. Below, we provide point-by-point responses to each referee comment. In the process, we have significantly revised the manuscript to include live-cell imaging data and a quantitative evaluation of imaging speed. We now more explicitly describe the different variants of lattice light-sheet microscopy—highlighting differences in their illumination flexibility and image acquisition modes—and clarify how Altair-LSFM compares to each. We further discuss challenges associated with the 5 mm coverslip and propose practical strategies to overcome them. Additionally, we outline cost-reduction opportunities, explain the rationale behind key equipment selections, and provide guidance for implementing environmental control. Altogether, we believe these additions have strengthened the manuscript and clarified both the capabilities and limitations of AltairLSFM.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      The article presents the details of the high-resolution light-sheet microscopy system developed by the group. In addition to presenting the technical details of the system, its resolution has been characterized and its functionality demonstrated by visualizing subcellular structures in a biological sample.

      Strengths: 

      (1) The article includes extensive supplementary material that complements the information in the main article.

      (2) However, in some sections, the information provided is somewhat superficial.

      We thank the reviewer for their thoughtful assessment and for recognizing the strengths of our manuscript, including the extensive supplementary material. Our goal was to make the supplemental content as comprehensive and useful as possible. In addition to the materials provided with the manuscript, our intention is for the online documentation (available at thedeanlab.github.io/altair) to serve as a living resource that evolves in response to user feedback. We would therefore greatly appreciate the reviewer’s guidance on which sections were perceived as superficial so that we can expand them to better support readers and builders of the system.

      Weaknesses:

      (1) Although a comparison is made with other light-sheet microscopy systems, the presented system does not represent a significant advance over existing systems. It uses high numerical aperture objectives and Gaussian beams, achieving resolution close to theoretical after deconvolution. The main advantage of the presented system is its ease of construction, thanks to the design of a perforated base plate.

      We appreciate the reviewer’s assessment and the opportunity to clarify our intent. Our primary goal was not to introduce new optical functionality beyond that of existing high-performance light-sheet systems, but rather to substantially reduce the barrier to entry for non-specialist laboratories. Many open-source implementations, such as OpenSPIM, OpenSPIN, and Benchtop mesoSPIM, similarly focused on accessibility and reproducibility rather than introducing new optical modalities, yet have had a measureable impact on the field by enabling broader community participation. Altair-LSFM follows this tradition, providing sub-cellular resolution performance comparable to advanced systems like LLSM, while emphasizing reproducibility, ease of construction through a precision-machined baseplate, and comprehensive documentation to facilitate dissemination and adoption.

      (2) Using similar objectives (Nikon 25x and Thorlabs 20x), the results obtained are similar to those of the LLSM system (using a Gaussian beam without laser modulation). However, the article does not mention the difficulties of mounting the sample in the implemented configuration.

      We appreciate the reviewer’s comment and agree that there are practical challenges associated with handling 5 mm diameter coverslips in this configuration. In the revised manuscript, we now explicitly describe these challenges and provide practical solutions. Specifically, we highlight the use of a custommachined coverslip holder designed to simplify mounting and handling, and we direct readers to an alternative configuration using the Zeiss W Plan-Apochromat 20×/1.0 objective, which eliminates the need for small coverslips altogether.

      (3) The authors present a low-cost, open-source system. Although they provide open source code for the software (navigate), the use of proprietary electronics (ASI, NI, etc.) makes the system relatively expensive. Its low cost is not justified.

      We appreciate the reviewer’s perspective and understand the concern regarding the use of proprietary control hardware such as the ASI Tiger Controller and NI data acquisition cards. Our decision to use these components was intentional: relying on a unified, professionally supported and maintained platform minimizes complexity associated with sourcing, configuring, and integrating hardware from multiple vendors, thereby reducing non-financial barriers to entry for non-specialist users.

      Importantly, these components are not the primary cost driver of Altair-LSFM (they represent roughly 18% of the total system cost). Nonetheless, for individuals where the price is prohibitive, we also outline several viable cost-reduction options in the revised manuscript (e.g., substituting manual stages, omitting the filter wheel, or using industrial CMOS cameras), while discussing the trade-offs these substitutions introduce in performance and usability. These considerations are now summarized in Supplementary Note 1, which provides a transparent rationale for our design and cost decisions.

      Finally, we note that even with these professional-grade components, Altair-LSFM remains substantially less expensive than commercial systems offering comparable optical performance, such as LLSM implementations from Zeiss or 3i.

      (4) The fibroblast images provided are of exceptional quality. However, these are fixed samples. The system lacks the necessary elements for monitoring cells in vivo, such as temperature or pH control.

      We thank the reviewer for their positive comment regarding the quality of our data. As noted, the current manuscript focuses on validating the optical performance and resolution of the system using fixed specimens to ensure reproducibility and stability.

      We fully agree on the importance of environmental control for live-cell imaging. In the revised manuscript, we now describe in detail how temperature regulation can be achieved using a custom-designed heated sample chamber, accompanied by detailed assembly instructions on our GitHub repository and summarized in Supplementary Note 2. For pH stabilization in systems lacking a 5% CO₂ atmosphere, we recommend supplementing the imaging medium with 10–25 mM HEPES buffer. Additionally, we include new live-cell imaging data demonstrating that Altair-LSFM supports in vitro time-lapse imaging of dynamic cellular processes under controlled temperature conditions.

      Reviewer #2 (Public review): 

      Summary: 

      The authors present Altair-LSFM (Light Sheet Fluorescence Microscope), a high-resolution, open-source microscope, that is relatively easy to align and construct and achieves sub-cellular resolution. The authors developed this microscope to fill a perceived need that current open-source systems are primarily designed for large specimens and lack sub-cellular resolution or are difficult to construct and align, and are not stable. While commercial alternatives exist that offer sub-cellular resolution, they are expensive. The authors' manuscript centers around comparisons to the highly successful lattice light-sheet microscope, including the choice of detection and excitation objectives. The authors thus claim that there remains a critical need for high-resolution, economical, and easy-to-implement LSFM systems. 

      We thank the reviewer for their thoughtful summary. We agree that existing open-source systems primarily emphasize imaging of large specimens, whereas commercial systems that achieve sub-cellular resolution remain costly and complex. Our aim with Altair-LSFM was to bridge this gap—providing LLSM-level performance in a substantially more accessible and reproducible format. By combining high-NA optics with a precision-machined baseplate and open-source documentation, Altair offers a practical, high-resolution solution that can be readily adopted by non-specialist laboratories.

      Strengths: 

      The authors succeed in their goals of implementing a relatively low-cost (~ USD 150K) open-source microscope that is easy to align. The ease of alignment rests on using custom-designed baseplates with dowel pins for precise positioning of optics based on computer analysis of opto-mechanical tolerances, as well as the optical path design. They simplify the excitation optics over Lattice light-sheet microscopes by using a Gaussian beam for illumination while maintaining lateral and axial resolutions of 235 and 350 nm across a 260-um field of view after deconvolution. In doing so they rest on foundational principles of optical microscopy that what matters for lateral resolution is the numerical aperture of the detection objective and proper sampling of the image field on to the detection, and the axial resolution depends on the thickness of the light-sheet when it is thinner than the depth of field of the detection objective. This concept has unfortunately not been completely clear to users of high-resolution light-sheet microscopes and is thus a valuable demonstration. The microscope is controlled by an open-source software, Navigate, developed by the authors, and it is thus foreseeable that different versions of this system could be implemented depending on experimental needs while maintaining easy alignment and low cost. They demonstrate system performance successfully by characterizing their sheet, point-spread function, and visualization of sub-cellular structures in mammalian cells, including microtubules, actin filaments, nuclei, and the Golgi apparatus.

      We thank the reviewer for their thoughtful and generous assessment of our work. We are pleased that the manuscript’s emphasis on fundamental optical principles, design rationale, and practical implementation was clearly conveyed. We agree that Altair’s modular and accessible architecture provides a strong foundation for future variants tailored to specific experimental needs. To facilitate this, we have made all Zemax simulations, CAD files, and build documentation openly available on our GitHub repository, enabling users to adapt and extend the system for diverse imaging applications.

      Weaknesses:

      There is a fixation on comparison to the first-generation lattice light-sheet microscope, which has evolved significantly since then:

      (1) The authors claim that commercial lattice light-sheet microscopes (LLSM) are "complex, expensive, and alignment intensive", I believe this sentence applies to the open-source version of LLSM, which was made available for wide dissemination. Since then, a commercial solution has been provided by 3i, which is now being used in multiple cores and labs but does require routine alignments. However, Zeiss has also released a commercial turn-key system, which, while expensive, is stable, and the complexity does not interfere with the experience of the user. Though in general, statements on ease of use and stability might be considered anecdotal and may not belong in a scientific article, unreferenced or without data.

      We thank the reviewer for this thoughtful and constructive comment. We have revised the manuscript to more clearly distinguish between the original open-source implementation of LLSM and subsequent commercial versions by 3i and ZEISS. The revised Introduction and Discussion now explicitly note that while open-source and early implementations of LLSM can require expert alignment and maintenance, commercial systems—particularly the ZEISS Lattice Lightsheet 7—are designed for automated operation and stable, turn-key use, albeit at higher cost and with limited modifiability. We have also moderated earlier language regarding usability and stability to avoid anecdotal phrasing.

      We also now provide a more objective proxy for system complexity: the number of optical elements that require precise alignment during assembly and maintenance thereafter. The original open-source LLSM setup includes approximately 29 optical components that must each be carefully positioned laterally, angularly, and coaxially along the optical path. In contrast, the first-generation Altair-LSFM system contains only nine such elements. By this metric, Altair-LSFM is considerably simpler to assemble and align, supporting our overarching goal of making high-resolution light-sheet imaging more accessible to non-specialist laboratories.

      (2) One of the major limitations of the first generation LLSM was the use of a 5 mm coverslip, which was a hinderance for many users. However, the Zeiss system elegantly solves this problem, and so does Oblique Plane Microscopy (OPM), while the Altair-LSFM retains this feature, which may dissuade widespread adoption. This limitation and how it may be overcome in future iterations is not discussed.

      We thank the reviewer for this helpful comment. We agree that the use of 5 mm diameter coverslips, while enabling high-NA imaging in the current Altair-LSFM configuration, may pose a practical limitation for some users. We now discuss this more explicitly in the revised manuscript. Specifically, we note that replacing the detection objective provides a straightforward solution to this constraint. For example, as demonstrated by Moore et al. (Lab Chip, 2021), pairing the Zeiss W Plan-Apochromat 20×/1.0 detection objective with the Thorlabs TL20X-MPL illumination objective allows imaging beyond the physical surfaces of both objectives, eliminating the need for small-format coverslips. In the revised text, we propose this modification as an accessible path toward greater compatibility with conventional sample mounting formats. We also note in the Discussion that Oblique Plane Microscopy (OPM) inherently avoids such nonstandard mounting requirements and, owing to its single-objective architecture, is fully compatible with standard environmental chambers.

      (3) Further, on the point of sample flexibility, all generations of the LLSM, and by the nature of its design, the OPM, can accommodate live-cell imaging with temperature, gas, and humidity control. It is unclear how this would be implemented with the current sample chamber. This limitation would severely limit use cases for cell biologists, for which this microscope is designed. There is no discussion on this limitation or how it may be overcome in future iterations.

      We thank the reviewer for this important observation and agree that environmental control is critical for live-cell imaging applications. It is worth noting that the original open-source LLSM design, as well as the commercial version developed by 3i, provided temperature regulation but did not include integrated control of CO2 or humidity. Despite this limitation, these systems have been widely adopted and have generated significant biological insights. We also acknowledge that both OPM and the ZEISS implementation of LLSM offer clear advantages in this respect, providing compatibility with standard commercial environmental chambers that support full regulation of temperature, CO₂, and humidity.

      In the revised manuscript, we expand our discussion of environmental control in Supplementary Note 2, where we describe the Altair-LSFM chamber design in more detail and discuss its current implementation of temperature regulation and HEPES-based pH stabilization. Additionally, the Discussion now explicitly notes that OPM avoids the challenges associated with non-standard sample mounting and is inherently compatible with conventional environmental enclosures.

      (4) The authors' comparison to LLSM is constrained to the "square" lattice, which, as they point out, is the most used optical lattice (though this also might be considered anecdotal). The LLSM original design, however, goes far beyond the square lattice, including hexagonal lattices, the ability to do structured illumination, and greater flexibility in general in terms of light-sheet tuning for different experimental needs, as well as not being limited to just sample scanning. Thus, the Alstair-LSFM cannot compare to the original LLSM in terms of versatility, even if comparisons to the resolution provided by the square lattice are fair.

      We agree that the original LLSM design offers substantially greater flexibility than what is reflected in our initial comparison, including the ability to generate multiple lattice geometries (e.g., square and hexagonal), operate in structured illumination mode, and acquire volumes using both sample- and lightsheet–scanning strategies. To address this, we now include Supplementary Note 3 that provides a detailed overview of the illumination modes and imaging flexibility afforded by the original LLSM implementation, and how these capabilities compare to both the commercial ZEISS Lattice Lightsheet 7 and our AltairLSFM system. In addition, we have revised the discussion to explicitly acknowledge that the original LLSM could operate in alternative scan strategies beyond sample scanning, providing greater context for readers and ensuring a more balanced comparison.

      (5) There is no demonstration of the system's live-imaging capabilities or temporal resolution, which is the main advantage of existing light-sheet systems.

      In the revised manuscript, we now include a demonstration of live-cell imaging to directly validate AltairLSFM’s suitability for dynamic biological applications. We also explicitly discuss the temporal resolution of the system in the main text (see Optoelectronic Design of Altair-LSFM), where we detail both software- and hardware-related limitations. Specifically, we evaluate the maximum imaging speed achievable with Altair-LSFM in conjunction with our open-source control software, navigate.

      For simplicity and reduced optoelectronic complexity, the current implementation powers the piezo through the ASI Tiger Controller, which modestly reduces its bandwidth. Nonetheless, for a 100 µm stroke typical of light-sheet imaging, we achieved sufficient performance to support volumetric imaging at most biologically relevant timescales. These results, along with additional discussion of the design trade-offs and performance considerations, are now included in the revised manuscript and expanded upon in the supplementary material.

      While the microscope is well designed and completely open source, it will require experience with optics, electronics, and microscopy to implement and align properly. Experience with custom machining or soliciting a machine shop is also necessary. Thus, in my opinion, it is unlikely to be implemented by a lab that has zero prior experience with custom optics or can hire someone who does. Altair-LSFM may not be as easily adaptable or implementable as the authors describe or perceive in any lab that is interested, even if they can afford it. The authors indicate they will offer "workshops," but this does not necessarily remove the barrier to entry or lower it, perhaps as significantly as the authors describe.

      We appreciate the reviewer’s perspective and agree that building any high-performance custom microscope—Altair-LSFM included—requires a basic understanding of (or willingness to learn) optics, electronics, and instrumentation. Such a barrier exists for all open-source microscopes, and our goal is not to eliminate this requirement entirely but to substantially reduce the technical and logistical challenges that typically accompany the construction of custom light-sheet systems.

      Importantly, no machining experience or in-house fabrication capabilities are required. Users can simply submit the provided CAD design files and specifications directly to commercial vendors for fabrication. We have made this process as straightforward as possible by supplying detailed build instructions, recommended materials, and vendor-ready files through our GitHub repository. Our dissemination strategy draws inspiration from other successful open-source projects such as mesoSPIM, which has seen widespread adoption—over 30 implementations worldwide—through a similar model of exhaustive documentation, open-source software, and community support via user meetings and workshops.

      We also recognize that documentation alone cannot fully replace hands-on experience. To further lower barriers to adoption, we are actively working with commercial vendors to streamline procurement and assembly, and Altair-LSFM is supported by a Biomedical Technology Development and Dissemination (BTDD) grant that provides resources for hosting workshops, offering real-time community support, and developing supplementary training materials.

      In the revised manuscript, we now expand the Discussion to explicitly acknowledge these implementation considerations and to outline our ongoing efforts to support a broad and diverse user base, ensuring that laboratories with varying levels of technical expertise can successfully adopt and maintain the Altair-LSFM platform.

      There is a claim that this design is easily adaptable. However, the requirement of custom-machined baseplates and in silico optimization of the optical path basically means that each new instrument is a new design, even if the Navigate software can be used. It is unclear how Altair-LSFM demonstrates a modular design that reduces times from conception to optimization compared to previous implementations.

      We thank the reviewer for this insightful comment and agree that our original language regarding adaptability may have overstated the degree to which Altair-LSFM can be modified without prior experience. It was not our intention to imply that the system can be easily redesigned by users with limited technical background. Meaningful adaptations of the optical or mechanical design do require expertise in optical layout, optomechanical design, and alignment.

      That said, for laboratories with such expertise, we aim to facilitate modifications by providing comprehensive resources—including detailed Zemax simulations, complete CAD models, and alignment documentation. These materials are intended to reduce the development burden for expert users seeking to tailor the system to specific experimental requirements, without necessitating a complete re-optimization of the optical path from first principles.

      In the revised manuscript, we clarify this point and temper our language regarding adaptability to better reflect the realistic scope of customization. Specifically, we now state in the Discussion: “For expert users who wish to tailor the instrument, we also provide all Zemax illumination-path simulations and CAD files, along with step-by-step optimization protocols, enabling modification and re-optimization of the optical system as needed.” This revision ensures that readers clearly understand that Altair-LSFM is designed for reproducibility and straightforward assembly in its default configuration, while still offering the flexibility for modification by experienced users.

      Reviewer #3 (Public review):

      Summary: 

      This manuscript introduces a high-resolution, open-source light-sheet fluorescence microscope optimized for sub-cellular imaging. The system is designed for ease of assembly and use, incorporating a custommachined baseplate and in silico optimized optical paths to ensure robust alignment and performance. The authors demonstrate lateral and axial resolutions of ~235 nm and ~350 nm after deconvolution, enabling imaging of sub-diffraction structures in mammalian cells. The important feature of the microscope is the clever and elegant adaptation of simple gaussian beams, smart beam shaping, galvo pivoting and high NA objectives to ensure a uniform thin light-sheet of around 400 nm in thickness, over a 266 micron wide Field of view, pushing the axial resolution of the system beyond the regular diffraction limited-based tradeoffs of light-sheet fluorescence microscopy. Compelling validation using fluorescent beads and multicolor cellular imaging highlights the system's performance and accessibility. Moreover, a very extensive and comprehensive manual of operation is provided in the form of supplementary materials. This provides a DIY blueprint for researchers who want to implement such a system.

      We thank the reviewer for their thoughtful and positive assessment of our work. We appreciate their recognition of Altair-LSFM’s design and performance, including its ability to achieve high-resolution, imaging throughout a 266-micron field of view. While Altair-LSFM approaches the practical limits of diffraction-limited performance, it does not exceed the fundamental diffraction limit; rather, it achieves near-theoretical resolution through careful optical optimization, beam shaping, and alignment. We are grateful for the reviewer’s acknowledgment of the accessibility and comprehensive documentation that make this system broadly implementable.

      Strengths:

      (1) Strong and accessible technical innovation: With an elegant combination of beam shaping and optical modelling, the authors provide a high-resolution light-sheet system that overcomes the classical light-sheet tradeoff limit of a thin light-sheet and a small field of view. In addition, the integration of in silico modelling with a custom-machined baseplate is very practical and allows for ease of alignment procedures. Combining these features with the solid and super-extensive guide provided in the supplementary information, this provides a protocol for replicating the microscope in any other lab.

      (2) Impeccable optical performance and ease of mounting of samples: The system takes advantage of the same sample-holding method seen already in other implementations, but reduces the optical complexity.

      At the same time, the authors claim to achieve similar lateral and axial resolution to Lattice-light-sheet microscopy (although without a direct comparison (see below in the "weaknesses" section). The optical characterization of the system is comprehensive and well-detailed. Additionally, the authors validate the system imaging sub-cellular structures in mammalian cells.

      (3) Transparency and comprehensiveness of documentation and resources: A very detailed protocol provides detailed documentation about the setup, the optical modeling, and the total cost.

      We thank the reviewer for their thoughtful and encouraging comments. We are pleased that the technical innovation, optical performance, and accessibility of Altair-LSFM were recognized. Our goal from the outset was to develop a diffraction-limited, high-resolution light-sheet system that balances optical performance with reproducibility and ease of implementation. We are also pleased that the use of precisionmachined baseplates was recognized as a practical and effective strategy for achieving performance while maintaining ease of assembly.

      Weaknesses: 

      (1) Limited quantitative comparisons: Although some qualitative comparison with previously published systems (diSPIM, lattice light-sheet) is provided throughout the manuscript, some side-by-side comparison would be of great benefit for the manuscript, even in the form of a theoretical simulation. While having a direct imaging comparison would be ideal, it's understandable that this goes beyond the interest of the paper; however, a table referencing image quality parameters (taken from the literature), such as signalto-noise ratio, light-sheet thickness, and resolutions, would really enhance the features of the setup presented. Moreover, based also on the necessity for optical simplification, an additional comment on the importance/difference of dual objective/single objective light-sheet systems could really benefit the discussion.

      In the revised manuscript, we have significantly expanded our discussion of different light-sheet systems to provide clearer quantitative and conceptual context for Altair-LSFM. These comparisons are based on values reported in the literature, as we do not have access to many of these instruments (e.g., DaXi, diSPIM, or commercial and open-source variants of LLSM), and a direct experimental comparison is beyond the scope of this work.

      We note that while quantitative parameters such as signal-to-noise ratio are important, they are highly sample-dependent and strongly influenced by imaging conditions, including fluorophore brightness, camera characteristics, and filter bandpass selection. For this reason, we limited our comparison to more general image-quality metrics—such as light-sheet thickness, resolution, and field of view—that can be reliably compared across systems.

      Finally, per the reviewer’s recommendation, we have added additional discussion clarifying the differences between dual-objective and single-objective light-sheet architectures, outlining their respective strengths, limitations, and suitability for different experimental contexts.

      (2) Limitation to a fixed sample: In the manuscript, there is no mention of incubation temperature, CO₂ regulation, Humidity control, or possible integration of commercial environmental control systems. This is a major limitation for an imaging technique that owes its popularity to fast, volumetric, live-cell imaging of biological samples.

      We fully agree that environmental control is critical for live-cell imaging applications. In the revised manuscript, we now describe the design and implementation of a temperature-regulated sample chamber in Supplementary Note 2, which maintains stable imaging conditions through the use of integrated heating elements and thermocouples. This approach enables precise temperature control while minimizing thermal gradients and optical drift. For pH stabilization, we recommend the use of 10–25 mM HEPES in place of CO₂ regulation, consistent with established practice for most light-sheet systems, including the initial variant of LLSM. Although full humidity and CO₂ control are not readily implemented in dual-objective configurations, we note that single-objective designs such as OPM are inherently compatible with commercial environmental chambers and avoid these constraints. Together, these additions clarify how environmental control can be achieved within Altair-LSFM and situate its capabilities within the broader LSFM design space.

      (3) System cost and data storage cost: While the system presented has the advantage of being opensource, it remains relatively expensive (considering the 150k without laser source and optical table, for example). The manuscript could benefit from a more direct comparison of the performance/cost ratio of existing systems, considering academic settings with budgets that most of the time would not allow for expensive architectures. Moreover, it would also be beneficial to discuss the adaptability of the system, in case a 30k objective could not be feasible. Will this system work with different optics (with the obvious limitations coming with the lower NA objective)? This could be an interesting point of discussion. Adaptability of the system in case of lower budgets or more cost-effective choices, depending on the needs.

      We agree that cost considerations are critical for adoption in academic environments. We would also like to clarify that the quoted $150k includes the optical table and laser source. In the revised manuscript, Supplementary Note 1 now includes an expanded discussion of cost–performance trade-offs and potential paths for cost reduction.

      Last, not much is said about the need for data storage. Light-sheet microscopy's bottleneck is the creation of increasingly large datasets, and it could be beneficial to discuss more about the storage needs and the quantity of data generated.

      In the revised manuscript, we now include Supplementary Note 4, which provides a high-level discussion of data storage needs, approximate costs, and practical strategies for managing large datasets generated by light-sheet microscopy. This section offers general guidance—including file-format recommendations, and cost considerations—but we note that actual costs will vary by institution and contractual agreements.

      Conclusion:

      Altair-LSFM represents a well-engineered and accessible light-sheet system that addresses a longstanding need for high-resolution, reproducible, and affordable sub-cellular light-sheet imaging. While some aspects-comparative benchmarking and validation, limitation for fixed samples-would benefit from further development, the manuscript makes a compelling case for Altair-LSFM as a valuable contribution to the open microscopy scientific community. 

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) A picture, or full CAD design of the complete instrument, should be included as a main figure.

      A complete CAD rendering of the microscope is now provided in Supplementary Figure 4.

      (2) There is no quantitative comparison of the effects of the tilting resonant galvo; only a cartoon, a figure should be included.

      The cartoon was intended purely as an educational illustration to conceptually explain the role of the tilting resonant galvo in shaping and homogenizing the light sheet. To clarify this intent, we have revised both the figure legend and corresponding text in the main manuscript. For readers seeking quantitative comparisons, we now reference the original study that provides a detailed analysis of this optical approach, as well as a review on the subject.

      (3) Description of L4 is missing in the Figure 1 caption.

      Thank you for catching this omission. We have corrected it.

      (4) The beam profiles in Figures 1c and 3a, please crop and make the image bigger so the profile can be appreciated. The PSFs in Figure 3c-e should similarly be enlarged and presented using a dynamic range/LUT such that any aberrations can be appreciated.

      In Figure 1c, our goal was to qualitatively illustrate the uniformity of the light-sheet across the full field of view, while Figure 1d provided the corresponding quantitative cross-section. To improve clarity, we have added an additional figure panel offering a higher-magnification, localized view of the light-sheet profile. For Figure 3c–e, we have enlarged the PSF images and adjusted the display range to better convey the underlying signal and allow subtle aberrations to be appreciated.

      (5) It is unclear why LLSM is being used as the gold standard, since in its current commercial form, available from Zeiss, it is a turn-key system designed for core facilities. The original LLSM is also a versatile instrument that provides much more than the square lattice for illumination, including structured illumination, hexagonal lattices, live-cell imaging, wide-field illumination, different scan modes, etc. These additional features are not even mentioned when compared to the Altair-LSFM. If a comparison is to be provided, it should be fair and balanced. Furthermore, as outlined in the public review, anecdotal statements on "most used", "difficult to align", or "unstable" should not be provided without data.

      In the revised manuscript, we have carefully removed anecdotal statements and, where appropriate, replaced them with quantitative or verifiable information. For instance, we now explicitly report that the square lattice was used in 16 of the 20 figure subpanels in the original LLSM publication, and we include a proxy for optical complexity based on the number of optical elements requiring alignment in each system.

      We also now clearly distinguish between the original LLSM design—which supports multiple illumination and scanning modes—and its subsequent commercial variants, including the ZEISS Lattice Lightsheet 7, which prioritizes stability and ease of use over configurational flexibility (see Supplementary Note 3).

      (6) The authors should recognize that implementing custom optics, no matter how well designed, is a big barrier to cross for most cell biology labs.

      We fully understand and now acknowledge in the main text that implementing custom optics can present a significant barrier, particularly for laboratories without prior experience in optical system assembly. However, similar challenges were encountered during the adoption of other open-source microscopy platforms, such as mesoSPIM and OpenSPIM, both of which have nonetheless achieved widespread implementation. Their success has largely been driven by exhaustive documentation, strong community support, and standardized design principles—approaches we have also prioritized in Altair-LSFM. We have therefore made all CAD files, alignment guides, and detailed build documentation publicly available and continue to develop instructional materials and community resources to further reduce the barrier to adoption.

      (7) Statements on "hands on workshops" though laudable, may not be appropriate to include in a scientific publication without some documentation on the influence they have had on implanting the microscope.

      We understand the concern. Our intention in mentioning hands-on workshops was to convey that the dissemination effort is supported by an NIH Biomedical Technology Development and Dissemination grant, which includes dedicated channels for outreach and community engagement. Nonetheless, we agree that such statements are not appropriate without formal documentation of their impact, and we have therefore removed this text from the revised manuscript.

      (8) It is claimed that the microscope is "reliable" in the discussion, but with no proof, long-term stability should be assessed and included.

      Our experience with Altair-LSFM has been that it remains well-aligned over time—especially in comparison to other light-sheet systems we worked on throughout the last 11 years—we acknowledge that this assessment is anecdotal. As such, we have omitted this claim from the revised manuscript.

      (9) Due to the reliance on anecdotal statements and comparisons without proof to other systems, this paper at times reads like a brochure rather than a scientific publication. The authors should consider editing their manuscript accordingly to focus on the technical and quantifiable aspects of their work.

      We agree with the reviewer’s assessment and have revised the manuscript to remove anecdotal comparisons and subjective language. Where possible, we now provide quantitative metrics or verifiable data to support our statements.

      Reviewer #3 (Recommendations for the authors):

      Other minor points that could improve the manuscript (although some of these points are explained in the huge supplementary manual): 

      (1) The authors explain thoroughly their design, and they chose a sample-scanning method. I think that a brief discussion of the advantages and disadvantages of such a method over, for example, a laserscanning system (with fixed sample) in the main text will be highly beneficial for the users.

      In the revised manuscript, we now include a brief discussion in the main text outlining the advantages and limitations of a sample-scanning approach relative to a light-sheet–scanning system. Specifically, we note that for thin, adherent specimens, sample scanning minimizes the optical path length through the sample, allowing the use of more tightly focused illumination beams that improve axial resolution. We also include a new supplementary figure illustrating how this configuration reduces the propagation length of the illumination light sheet, thereby enhancing axial resolution.

      (2) The authors justify selecting a 0.6 NA illumination objective over alternatives (e.g., Special Optics), but the manuscript would benefit from a more quantitative trade-off analysis (beam waist, working distance, sample compatibility) with other possibilities. Within the objective context, a comparison of the performances of this system with the new and upcoming single-objective light-sheet methods (and the ones based also on optical refocusing, e.g., DAXI) would be very interesting for the goodness of the manuscript.

      In the revised manuscript, we now provide a quantitative trade-off analysis of the illumination objectives in Supplementary Note 1, including comparisons of beam waist, working distance, and sample compatibility. This section also presents calculated point spread functions for both the 0.6 NA and 0.67 NA objectives, outlining the performance trade-offs that informed our design choice. In addition, Supplementary Note 3 now includes a broader comparison of Altair-LSFM with other light-sheet modalities, including diSPIM, ASLM, and OPM, to further contextualize the system’s capabilities within the evolving light-sheet microscopy landscape.

      (3) The modularity of the system is implied in the context of the manuscript, but not fully explained. The authors should specify more clearly, for example, if cameras could be easily changed, objectives could be easily swapped, light-sheet thickness could be tuned by changing cylindrical lens, how users might adapt the system for different samples (e.g., embryos, cleared tissue, live imaging), .etc, and discuss eventual constraints or compatibility issues to these implementations.

      Altair-LSFM was explicitly designed and optimized for imaging live adherent cells, where sample scanning and short light-sheet propagation lengths provide optimal axial resolution (Supplementary Note 3). While the same platform could be used for superficial imaging in embryos, systems implementing multiview illumination and detection schemes are better suited for such specimens. Similarly, cleared tissue imaging typically requires specialized solvent-compatible objectives and approaches such as ASLM that maximize the field of view. We have now added some text to the Design Principles section that explicitly state this.

      Altair-LSFM offers varying levels of modularity depending on the user’s level of expertise. For entry-level users, the illumination numerical aperture—and therefore the light-sheet thickness and propagation length—can be readily adjusted by tuning the rectangular aperture conjugate to the back pupil of the illumination objective, as described in the Design Principles section. For mid-level users, alternative configurations of Altair-LSFM, including different detection objectives, stages, filter wheels, or cameras, can be readily implemented (Supplementary Note 1). Importantly, navigate natively supports a broad range of hardware devices, and new components can be easily integrated through its modular interface. For expert users, all Zemax simulations, CAD models, and step-by-step optimization protocols are openly provided, enabling complete re-optimization of the optical design to meet specific experimental requirements.

      (4) Resolution measurements before and after deconvolution are central to the performance claim, but the deconvolution method (PetaKit5D) is only briefly mentioned in the main text, it's not referenced, and has to be clarified in more detail, coherently with the precision of the supplementary information. More specifically, PetaKit5D should be referenced in the main text, the details of the deconvolution parameters discussed in the Methods section, and the computational requirements should also be mentioned. 

      In the revised manuscript, we now provide a dedicated description of the deconvolution process in the Methods section, including the specific parameters and algorithms used. We have also explicitly referenced PetaKit5D in the main text to ensure proper attribution and clarity. Additionally, we note the computational requirements associated with this analysis in the same section for completeness.

      (5)  Image post-processing is not fully explained in the main text. Since the system is sample-scanning based, no word in the main text is spent on deskewing, which is an integral part of the post-processing to obtain a "straight" 3D stack. Since other systems implement such a post-processing algorithm (for example, single-objective architectures), it would be beneficial to have some discussion about this, and also a brief comparison to other systems in the main text in the methods section. 

      In the revised manuscript, we now explicitly describe both deskewing (shearing) and deconvolution procedures in the Alignment and Characterization section of the main text and direct readers to the Methods section. We also briefly explain why the data must be sheared to correct for the angled sample-scanning geometry for LLSM and Altair-LSFM, as well as both sample-scanning and laser-scanning-variants of OPMs.

      (6) A brief discussion on comparative costs with other systems (LLSM, dispim, etc.) could be helpful for non-imaging expert researchers who could try to implement such an optical architecture in their lab.

      Unfortunately, the exact costs of commercial systems such as LLSM or diSPIM are typically not publicly available, as they depend on institutional agreements and vendor-specific quotations. Nonetheless, we now provide approximate cost estimates in Supplementary Note 1 to help readers and prospective users gauge the expected scale of investment relative to other advanced light-sheet microscopy systems.

      (7) The "navigate" control software is provided, but a brief discussion on its advantages compared to an already open-access system, such as Micromanager, could be useful for the users.

      In the revised manuscript, we now include Supplementary Note 5 that discusses the advantages and disadvantages of different open-source microscope control platforms, including navigate and MicroManager. In brief, navigate was designed to provide turnkey support for multiple light-sheet architectures, with pre-configured acquisition routines optimized for Altair-LSFM, integrated data management with support for multiple file formats (TIFF, HDF5, N5, and Zarr), and full interoperability with OMEcompliant workflows. By contrast, while Micro-Manager offers a broader library of hardware drivers, it typically requires manual configuration and custom scripting for advanced light-sheet imaging workflows.

      (8) The cost and parts are well documented, but the time and expertise required are not crystal clear.Adding a simple time estimate (perhaps in the Supplement Section) of assembly/alignment/installation/validation and first imaging will be very beneficial for users. Also, what level of expertise is assumed (prior optics experience, for example) to be needed to install a system like this? This can help non-optics-expert users to better understand what kind of adventure they are putting themselves through.

      We thank the reviewer for this helpful suggestion. To address this, we have added Supplementary Table S5, which provides approximate time estimates for assembly, alignment, validation, and first imaging based on the user’s prior experience with optical systems. The table distinguishes between novice (no prior experience), moderate (some experience using but not assembling optical systems), and expert (experienced in building and aligning optical systems) users. This addition is intended to give prospective builders a realistic sense of the time commitment and level of expertise required to assemble and validate AltairLSFM.

      Minor things in the main text:

      (1) Line 109: The cost is considered "excluding the laser source". But then in the table of costs, you mention L4cc as a "multicolor laser source", for 25 K. Can you explain this better? Are the costs correct with or without the laser source? 

      We acknowledge that the statement in line 109 was incorrect—the quoted ~$150k system cost does include the laser source (L4cc, listed at $25k in the cost table). We have corrected this in the revised manuscript.

      (2) Line 113: You say "lateral resolution, but then you state a 3D resolution (230 nm x 230 nm x 370 nm). This needs to be fixed.

      Thank you, we have corrected this.

      (3) Line 138: Is the light-sheet uniformity proven also with a fluorescent dye? This could be beneficial for the main text, showing the performance of the instrument in a fluorescent environment.

      The light-sheet profiles shown in the manuscript were acquired using fluorescein to visualize the beam. We have revised the main text and figure legends to clearly state this.

      (4) Line 149: This is one of the most important features of the system, defying the usual tradeoff between light-sheet thickness and field of view, with a regular Gaussian beam. I would clarify more specifically how you achieve this because this really is the most powerful takeaway of the paper.

      We thank the reviewer for this key observation. The ability of Altair-LSFM to maintain a thin light sheet across a large field of view arises from diffraction effects inherent to high NA illumination. Specifically, diffraction elongates the PSF along the beam’s propagation direction, effectively extending the region over which the light sheet remains sufficiently thin for high-resolution imaging. This phenomenon, which has been the subject of active discussion within the light-sheet microscopy community, allows Altair-LSFM to partially overcome the conventional trade-off between light-sheet thickness and propagation length. We now clarify this point in the main text and provide a more detailed discussion in Supplementary Note 3, which is explicitly referenced in the discussion of the revised manuscript.

      (5) Line 171: You talk about repeatable assembly...have you tried many different baseplates? Otherwise, this is a complicated statement, since this is a proof-of-concept paper. 

      We thank the reviewer for this comment. We have not yet validated the design across multiple independently assembled baseplates and therefore agree that our previous statement regarding repeatable assembly was premature. To avoid overstating the current level of validation, we have removed this statement from the revised manuscript.

      (6) Line 187: same as above. You mention "long-term stability". For how long did you try this? This should be specified in numbers (days, weeks, months, years?) Otherwise, it is a complicated statement to make, since this is a proof-of-concept paper.

      We also agree that referencing long-term stability without quantitative backing is inappropriate, and have removed this statement from the revised manuscript.

      (7) Line 198: "rapid z-stack acquisition. How rapid? Also, what is the limitation of the galvo-scanning in terms of the imaging speed of the system? This should be noted in the methods section.

      In the revised manuscript, we now clarify these points in the Optoelectronic Design section. Specifically, we explicitly note that the resonant galvo used for shadow reduction operates at 4 kHz, ensuring that it is not rate-limiting for any imaging mode. In the same section, we also evaluate the maximum acquisition speeds achievable using navigate and report the theoretical bandwidth of the sample-scanning piezo, which together define the practical limits of volumetric acquisition speed for Altair-LSFM.

      (8) Line 234: Peta5Kit is discussed in the additional documentation, but should be referenced here, as well.

      We now reference and cite PetaKit5D.

      (9) Line 256: "values are on par with LLSM", but no values are provided. Some details should also be provided in the main text.

      In the revised manuscript, we now provide the lateral and axial resolution values originally reported for LLSM in the main text to facilitate direct comparison with Altair-LSFM. Additionally, Supplementary Note 3 now includes an expanded discussion on the nuances of resolution measurement and reporting in lightsheet microscopy.

      Figures:

      (1) Figure 1 could be implemented with Figure 3. They're both discussing the validation of the system (theoretically and with simulations), and they could be together in different panels of the same figure. The experimental light-sheet seems to be shown in a transmission mode. Showing a pattern in a fluorescent dye could also be beneficial for the paper.

      In Figure 1, our goal was to guide readers through the design process—illustrating how the detection objective’s NA sets the system’s resolution, which defines the required pixel size for Nyquist sampling and, in turn, the field of view. We then use Figure 1b–c to show how the illumination beam was designed and simulated to achieve that field of view. In contrast, Figure 3 presents the experimental validation of the illumination system. To avoid confusion, we now clarify in the text that the light sheet shown in Figure 3 was visualized in a fluorescein solution and imaged in transmission mode. While we agree that Figures 1 and 3 both serve to validate the system, we prefer to keep them as separate figures to maintain focus within each panel. We believe this organization better supports the narrative structure and allows readers to digest the theoretical and experimental validations independently.

      (2) Figure 3: Panels d and e show the same thing. Why would you expect that xz and yz profiles should be different? Is this due to the orientation of the objectives towards the sample?

      In Figure 3, we present the PSF from all three orthogonal views, as this provides the most transparent assessment of PSF quality—certain aberration modes can be obscured when only select perspectives are shown. In principle, the XZ and YZ projections should be equivalent in a well-aligned system. However, as seen in the XZ projection, a small degree of coma is present that is not evident in the YZ view. We now explicitly note this observation in the revised figure caption to clarify the difference between these panels.

      (3) Figure 4's single boxes lack a scale bar, and some of the Supplementary Figures (e.g. Figure 5) lack detailed axis labels or scale bars. Also, in the detailed documentation, some figures are referred to as Figure 5. Figure 7 or, for example, figure 6. Figure 8, and this makes the cross-references very complicated to follow

      In the revised manuscript, we have corrected these issues. All figures and supplementary figures now include appropriate scale bars, axis labels, and consistent formatting. We have also carefully reviewed and standardized all cross-references throughout the main text and supplementary documentation to ensure that figure numbering is accurate and easy to follow.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Zhou and colleagues developed a computational model of replay that heavily builds on cognitive models of memory in context (e.g., the context-maintenance and retrieval model), which have been successfully used to explain memory phenomena in the past. Their model produces results that mirror previous empirical findings in rodents and offers a new computational framework for thinking about replay.

      Strengths:

      The model is compelling and seems to explain a number of findings from the rodent literature. It is commendable that the authors implement commonly used algorithms from wakefulness to model sleep/rest, thereby linking wake and sleep phenomena in a parsimonious way. Additionally, the manuscript's comprehensive perspective on replay, bridging humans and non-human animals, enhanced its theoretical contribution.

      Weaknesses:

      This reviewer is not a computational neuroscientist by training, so some comments may stem from misunderstandings. I hope the authors would see those instances as opportunities to clarify their findings for broader audiences.

      (1) The model predicts that temporally close items will be co-reactivated, yet evidence from humans suggests that temporal context doesn't guide sleep benefits (instead, semantic connections seem to be of more importance; Liu and Ranganath 2021, Schechtman et al 2023). Could these findings be reconciled with the model or is this a limitation of the current framework?

      We appreciate the encouragement to discuss this connection. Our framework can accommodate semantic associations as determinants of sleep-dependent consolidation, which can in principle outweigh temporal associations. Indeed, prior models in this lineage have extensively simulated how semantic associations support encoding and retrieval alongside temporal associations. It would therefore be straightforward to extend our model to simulate how semantic associations guide sleep benefits, and to compare their contribution against that conferred by temporal associations across different experimental paradigms. In the revised manuscript, we have added a discussion of how our framework may simulate the role of semantic associations in sleep-dependent consolidation.

      “Several recent studies have argued for dominance of semantic associations over temporal associations in the process of human sleep-dependent consolidation (Schechtman et al., 2023; Liu and Ranganath 2021; Sherman et al., 2025), with one study observing no role at all for temporal associations (Schechtman et al., 2023). At first glance, these findings appear in tension with our model, where temporal associations drive offline consolidation. Indeed, prior models have accounted for these findings by suppressing temporal context during sleep (Liu and Ranganath 2024; Sherman et al., 2025). However, earlier models in the CMR lineage have successfully captured the joint contributions of semantic and temporal associations to encoding and retrieval (Polyn et al., 2009), and these processes could extend naturally to offline replay. In a paradigm where semantic associations are especially salient during awake learning, the model could weight these associations more and account for greater co-reactivation and sleep-dependent memory benefits for semantically related than temporally related items. Consistent with this idea, Schechtman et al. (2023) speculated that their null temporal effects likely reflected the task’s emphasis on semantic associations. When temporal associations are more salient and task-relevant, sleep-related benefits for temporally contiguous items are more likely to emerge (e.g., Drosopoulos et al., 2007; King et al., 2017).”

      The reviewer’s comment points to fruitful directions for future work that could employ our framework to dissect the relative contributions of semantic and temporal associations to memory consolidation.

      (2) During replay, the model is set so that the next reactivated item is sampled without replacement (i.e., the model cannot get "stuck" on a single item). I'm not sure what the biological backing behind this is and why the brain can't reactivate the same item consistently.

      Furthermore, I'm afraid that such a rule may artificially generate sequential reactivation of items regardless of wake training. Could the authors explain this better or show that this isn't the case?

      We appreciate the opportunity to clarify this aspect of the model. We first note that this mechanism has long been a fundamental component of this class of models (Howard & Kahana 2002). Many classic memory models (Brown et al., 2000; Burgess & Hitch, 1991; Lewandowsky & Murdock 1989) incorporate response suppression, in which activated items are temporarily inhibited. The simplest implementation, which we use here, removes activated items from the pool of candidate items. Alternative implementations achieve this through transient inhibition, often conceptualized as neuronal fatigue (Burgess & Hitch, 1991; Grossberg 1978). Our model adopts a similar perspective, interpreting this mechanism as mimicking a brief refractory period that renders reactivated neurons unlikely to fire again within a short physiological event such as a sharp-wave ripple. Importantly, this approach does not generate spurious sequences. Instead, the model’s ability to preserve the structure of wake experience during replay depends entirely on the learned associations between items (without these associations, item order would be random). Similar assumptions are also common in models of replay. For example, reinforcement learning models of replay incorporate mechanisms such as inhibition to prevent repeated reactivations (e.g., Diekmann & Cheng, 2023) or prioritize reactivation based on ranking to limit items to a single replay (e.g., Mattar & Daw, 2018). We now discuss these points in the section titled “A context model of memory replay”

      “This mechanism of sampling without replacement, akin to response suppression in established context memory models (Howard & Kahana 2002), could be implemented by neuronal fatigue or refractory dynamics (Burgess & Hitch, 1991; Grossberg 1978). Non-repetition during reactivation is also a common assumption in replay models that regulate reactivation through inhibition or prioritization (Diekmann & Cheng 2023; Mattar & Daw 2018; Singh et al., 2022).”

      (3) If I understand correctly, there are two ways in which novelty (i.e., less exposure) is accounted for in the model. The first and more talked about is the suppression mechanism (lines 639-646). The second is a change in learning rates (lines 593-595). It's unclear to me why both procedures are needed, how they differ, and whether these are two different mechanisms that the model implements. Also, since the authors controlled the extent to which each item was experienced during wakefulness, it's not entirely clear to me which of the simulations manipulated novelty on an individual item level, as described in lines 593-595 (if any).

      We agree that these mechanisms and their relationships would benefit from clarification. As noted, novelty influences learning through two distinct mechanisms. First, the suppression mechanism is essential for capturing the inverse relationship between the amount of wake experience and the frequency of replay, as observed in several studies. This mechanism ensures that items with high wake activity are less likely to dominate replay. Second, the decrease in learning rates with repetition is crucial for preserving the stochasticity of replay. Without this mechanism, the model would increase weights linearly, leading to an exponential increase in the probability of successive wake items being reactivated back-to-back due to the use of a softmax choice rule. This would result in deterministic replay patterns, which are inconsistent with experimental observations.

      We have revised the Methods section to explicitly distinguish these two mechanisms:

      “This experience-dependent suppression mechanism is distinct from the reduction of learning rates through repetition; it does not modulate the update of memory associations but exclusively governs which items are most likely to initiate replay.”

      We have also clarified our rationale for including a learning rate reduction mechanism:

      “The reduction in learning rates with repetition is important for maintaining a degree of stochasticity in the model’s replay during task repetition, since linearly increasing weights would, through the softmax choice rule, exponentially amplify differences in item reactivation probabilities, sharply reducing variability in replay.”

      Finally, we now specify exactly where the learning-rate reduction applied, namely in simulations where sequences are repeated across multiple sessions:

      “In this simulation, the learning rates progressively decrease across sessions, as described above.“

      As to the first mechanism - experience-based suppression - I find it challenging to think of a biological mechanism that would achieve this and is selectively activated immediately before sleep (somehow anticipating its onset). In fact, the prominent synaptic homeostasis hypothesis suggests that such suppression, at least on a synaptic level, is exactly what sleep itself does (i.e., prune or weaken synapses that were enhanced due to learning during the day). This begs the question of whether certain sleep stages (or ultradian cycles) may be involved in pruning, whereas others leverage its results for reactivation (e.g., a sequential hypothesis; Rasch & Born, 2013). That could be a compelling synthesis of this literature. Regardless of whether the authors agree, I believe that this point is a major caveat to the current model. It is addressed in the discussion, but perhaps it would be beneficial to explicitly state to what extent the results rely on the assumption of a pre-sleep suppression mechanism.

      We appreciate the reviewer raising this important point. Unlike the mechanism proposed by the synaptic homeostasis hypothesis, the suppression mechanism in our model does not suppress items based on synapse strength, nor does it modify synaptic weights. Instead, it determines the level of suppression for each item based on activity during awake experience. The brain could implement such a mechanism by tagging each item according to its activity level during wakefulness. During subsequent consolidation, the initial reactivation of an item during replay would reflect this tag, influencing how easily it can be reactivated.

      A related hypothesis has been proposed in recent work, suggesting that replay avoids recently active trajectories due to spike frequency adaptation in neurons (Mallory et al., 2024). Similarly, the suppression mechanism in our model is critical for explaining the observed negative relationship between the amount of recent wake experience and the degree of replay.

      We discuss the biological plausibility of this mechanism and its relationship with existing models in the Introduction. In the section titled “The influence of experience”, we have added the following:

      “Our model implements an activity‑dependent suppression mechanism that, at the onset of each offline replay event, assigns each item a selection probability inversely proportional to its activation during preceding wakefulness. The brain could implement this by tagging each memory trace in proportion to its recent activation; during consolidation, that tag would then regulate starting replay probability, making highly active items less likely to be reactivated. A recent paper found that replay avoids recently traversed trajectories through awake spike‑frequency adaptation (Mallory et al., 2025), which could implement this kind of mechanism. In our simulations, this suppression is essential for capturing the inverse relationship between replay frequency and prior experience. Note that, unlike the synaptic homeostasis hypothesis (Tononi & Cirelli 2006), which proposes that the brain globally downscales synaptic weights during sleep, this mechanism leaves synaptic weights unchanged and instead biases the selection process during replay.”

      (4) As the manuscript mentions, the only difference between sleep and wake in the model is the initial conditions (a0). This is an obvious simplification, especially given the last author's recent models discussing the very different roles of REM vs NREM. Could the authors suggest how different sleep stages may relate to the model or how it could be developed to interact with other successful models such as the ones the last author has developed (e.g., C-HORSE)? 

      We appreciate the encouragement to comment on the roles of different sleep stages in the manuscript, especially since, as noted, the lab is very interested in this and has explored it in other work. We chose to focus on NREM in this work because the vast majority of electrophysiological studies of sleep replay have identified these events during NREM. In addition, our lab’s theory of the role of REM (Singh et al., 2022, PNAS) is that it is a time for the neocortex to replay remote memories, in complement to the more recent memories replayed during NREM. The experiments we simulate all involve recent memories. Indeed, our view is that part of the reason that there is so little data on REM replay may be that experimenters are almost always looking for traces of recent memories (for good practical and technical reasons).

      Regarding the simplicity of the distinction between simulated wake and sleep replay, we view it as an asset of the model that it can account for many of the different characteristics of awake and NREM replay with very simple assumptions about differences in the initial conditions. There are of course many other differences between the states that could be relevant to the impact of replay, but the current target empirical data did not necessitate us taking those into account. This allows us to argue that differences in initial conditions should play a substantial role in an account of the differences between wake and sleep replay.

      We have added discussion of these ideas and how they might be incorporated into future versions of the model in the Discussion section:

      “Our current simulations have focused on NREM, since the vast majority of electrophysiological studies of sleep replay have identified replay events in this stage. We have proposed in other work that replay during REM sleep may provide a complementary role to NREM sleep, allowing neocortical areas to reinstate remote, already-consolidated memories that need to be integrated with the memories that were recently encoded in the hippocampus and replayed during NREM (Singh et al., 2022). An extension of our model could undertake this kind of continual learning setup, where the student but not teacher network retains remote memories, and the driver of replay alternates between hippocampus (NREM) and cortex (REM) over the course of a night of simulated sleep. Other differences between stages of sleep and between sleep and wake states are likely to become important for a full account of how replay impacts memory. Our current model parsimoniously explains a range of differences between awake and sleep replay by assuming simple differences in initial conditions, but we expect many more characteristics of these states (e.g., neural activity levels, oscillatory profiles, neurotransmitter levels, etc.) will be useful to incorporate in the future.”

      Finally, I wonder how the model would explain findings (including the authors') showing a preference for reactivation of weaker memories. The literature seems to suggest that it isn't just a matter of novelty or exposure, but encoding strength. Can the model explain this? Or would it require additional assumptions or some mechanism for selective endogenous reactivation during sleep and rest?

      We appreciate the encouragement to discuss this, as we do think the model could explain findings showing a preference for reactivation of weaker memories, as in Schapiro et al. (2018). In our framework, memory strength is reflected in the magnitude of each memory’s associated synaptic weights, so that stronger memories yield higher retrieved‑context activity during wake encoding than weaker ones. Because the model’s suppression mechanism reduces an item’s replay probability in proportion to its retrieved‑context activity, items with larger weights (strong memories) are more heavily suppressed at the onset of replay, while those with smaller weights (weaker memories) receive less suppression. When items have matched reward exposure, this dynamic would bias offline replay toward weaker memories, therefore preferentially reactivating weak memories. 

      In the section titled “The influence of experience”, we updated a sentence to discuss this idea more explicitly: 

      “Such a suppression mechanism may be adaptive, allowing replay to benefit not only the most recently or strongly encoded items but also to provide opportunities for the consolidation of weaker or older memories, consistent with empirical evidence (e.g., Schapiro et al. 2018; Yu et al., 2024).”

      (5) Lines 186-200 - Perhaps I'm misunderstanding, but wouldn't it be trivial that an external cue at the end-item of Figure 7a would result in backward replay, simply because there is no potential for forward replay for sequences starting at the last item (there simply aren't any subsequent items)? The opposite is true, of course, for the first-item replay, which can't go backward. More generally, my understanding of the literature on forward vs backward replay is that neither is linked to the rodent's location. Both commonly happen at a resting station that is further away from the track. It seems as though the model's result may not hold if replay occurs away from the track (i.e. if a0 would be equal for both pre- and post-run).

      In studies where animals run back and forth on a linear track, replay events are decoded separately for left and right runs, identifying both forward and reverse sequences for each direction, for example using direction-specific place cell sequence templates. Accordingly, in our simulation of, e.g., Ambrose et al. (2016), we use two independent sequences, one for left runs and one for right runs (an approach that has been taken in prior replay modeling work). Crucially, our model assumes a context reset between running episodes, preventing the final item of one traversal from acquiring contextual associations with the first item of the next. As a result, learning in the two sequences remains independent, and when an external cue is presented at the track’s end, replay predominantly unfolds in the backward direction, only occasionally producing forward segments when the cue briefly reactivates an earlier sequence item before proceeding forward.

      We added a note to the section titled “The context-dependency of memory replay” to clarify this:

      “In our model, these patterns are identical to those in our simulation of Ambrose et al. (2016), which uses two independent sequences to mimic the two run directions. This is because the drifting context resets before each run sequence is encoded, with the pause between runs acting as an event boundary that prevents the final item of one traversal from associating with the first item of the next, thereby keeping learning in each direction independent.”

      To our knowledge, no study has observed a similar asymmetry when animals are fully removed from the track, although both types of replay can be observed when animals are away from the track. For example, Gupta et al. (2010) demonstrated that when animals replay trajectories far from their current location, the ratio of forward vs. backward replay appears more balanced. We now highlight this result in the manuscript and explain how it aligns with the predictions of our model:

      “For example, in tasks where the goal is positioned in the middle of an arm rather than at its end, CMR-replay predicts a more balanced ratio of forward and reverse replay, whereas the EVB model still predicts a dominance of reverse replay due to backward gain propagation from the reward. This contrast aligns with empirical findings showing that when the goal is located in the middle of an arm, replay events are more evenly split between forward and reverse directions (Gupta et al., 2010), whereas placing the goal at the end of a track produces a stronger bias toward reverse replay (Diba & Buzsaki 2007).” 

      Although no studies, to our knowledge, have observed a context-dependent asymmetry between forward and backward replay when the animal is away from the track, our model does posit conditions under which it could. Specifically, it predicts that deliberation on a specific memory, such as during planning, could generate an internal context input that biases replay: actively recalling the first item of a sequence may favor forward replay, while thinking about the last item may promote backward replay, even when the individual is physically distant from the track.

      We now discuss this prediction in the section titled “The context-dependency of memory replay”:

      “Our model also predicts that deliberation on a specific memory, such as during planning, could serve to elicit an internal context cue that biases replay: actively recalling the first item of a sequence may favor forward replay, while thinking about the last item may promote backward replay, even when the individual is physically distant from the track. While not explored here, this mechanism presents a potential avenue for future modeling and empirical work.”

      (6) The manuscript describes a study by Bendor & Wilson (2012) and tightly mimics their results. However, notably, that study did not find triggered replay immediately following sound presentation, but rather a general bias toward reactivation of the cued sequence over longer stretches of time. In other words, it seems that the model's results don't fully mirror the empirical results. One idea that came to mind is that perhaps it is the R/L context - not the first R/L item - that is cued in this study. This is in line with other TMR studies showing what may be seen as contextual reactivation. If the authors think that such a simulation may better mirror the empirical results, I encourage them to try. If not, however, this limitation should be discussed.

      Although our model predicts that replay is triggered immediately by the sound cue, it also predicts a sustained bias toward the cued sequence. Replay in our model unfolds across the rest phase as multiple successive events, so the bias observed in our sleep simulations indeed reflects a prolonged preference for the cued sequence.

      We now discuss this issue, acknowledging the discrepancy:

      “Bendor and Wilson (2012) found that sound cues during sleep did not trigger immediate replay, but instead biased reactivation toward the cued sequence over an extended period of time. While the model does exhibit some replay triggered immediately by the cue, it also captures the sustained bias toward the cued sequence over an extended period.”

      Second, within this framework, context is modeled as a weighted average of the features associated with items. As a result, cueing the model with the first R/L item produces qualitatively similar outcomes as cueing it with a more extended R/L cue that incorporates features of additional items. This is because both approaches ultimately use context features unique to the two sides.

      (7) There is some discussion about replay's benefit to memory. One point of interest could be whether this benefit changes between wake and sleep. Relatedly, it would be interesting to see whether the proportion of forward replay, backward replay, or both correlated with memory benefits. I encourage the authors to extend the section on the function of replay and explore these questions.

      We thank the reviewer for this suggestion. Regarding differences in the contribution of wake and sleep to memory, our current simulations predict that compared to rest in the task environment, sleep is less biased toward initiating replay at specific items, leading to a more uniform benefit across all memories. Regarding the contributions of forward and backward replay, our model predicts that both strengthen bidirectional associations between items and contexts, benefiting memory in qualitatively similar ways. Furthermore, we suggest that the offline learning captured  by our teacher-student simulations reflects consolidation processes that are specific to sleep.

      We have expanded the section titled The influence of experience to discuss these predictions of the model: 

      “The results outlined above arise from the model's assumption that replay strengthens bidirectional associations between items and contexts to benefit memory. This assumption leads to several predictions about differences across replay types. First, the model predicts that sleep yields different memory benefits compared to rest in the task environment: Sleep is less biased toward initiating replay at specific items, resulting in a more uniform benefit across all memories. Second, the model predicts that forward and backward replay contribute to memory in qualitatively similar ways but tend to benefit different memories. This divergence arises because forward and backward replay exhibit distinct item preferences, with backward replay being more likely to include rewarded items, thereby preferentially benefiting those memories.”

      We also updated the “The function of replay” section to include our teacher-student speculation:

      “We speculate that the offline learning observed in these simulations corresponds to consolidation processes that operate specifically during sleep, when hippocampal-neocortical dynamics are especially tightly coupled (Klinzing et al., 2019).”

      (8) Replay has been mostly studied in rodents, with few exceptions, whereas CMR and similar models have mostly been used in humans. Although replay is considered a good model of episodic memory, it is still limited due to limited findings of sequential replay in humans and its reliance on very structured and inherently autocorrelated items (i.e., place fields). I'm wondering if the authors could speak to the implications of those limitations on the generalizability of their model. Relatedly, I wonder if the model could or does lead to generalization to some extent in a way that would align with the complementary learning systems framework.

      We appreciate these insightful comments. Traditionally, replay studies have focused on spatial tasks with autocorrelated item representations (e.g., place fields). However, an increasing number of human studies have demonstrated sequential replay using stimuli with distinct, unrelated representations. Our model is designed to accommodate both scenarios. In our current simulations, we employ orthogonal item representations while leveraging a shared, temporally autocorrelated context to link successive items. We anticipate that incorporating autocorrelated item representations would further enhance sequence memory by increasing the similarity between successive contexts. Overall, we believe that the model generalizes across a broad range of experimental settings, regardless of the degree of autocorrelation between items. Moreover, the underlying framework has been successfully applied to explain sequential memory in both spatial domains, explaining place cell firing properties (e.g., Howard et al., 2004), and in non-spatial domains, such as free recall experiments where items are arbitrarily related. 

      In the section titled “A context model of memory replay”, we added this comment to address this point:

      “Its contiguity bias stems from its use of shared, temporally autocorrelated context to link successive items, despite the orthogonal nature of individual item representations. This bias would be even stronger if items had overlapping representations, as observed in place fields.”

      Since CMR-replay learns distributed context representations where overlap across context vectors captures associative structure, and replay helps strengthen that overlap, this could indeed be viewed as consonant with complementary learning systems integration processes. 

      Reviewer #2 (Public Review):

      This manuscript proposes a model of replay that focuses on the relation between an item and its context, without considering the value of the item. The model simulates awake learning, awake replay, and sleep replay, and demonstrates parallels between memory phenomenon driven by encoding strength, replay of sequence learning, and activation of nearest neighbor to infer causality. There is some discussion of the importance of suppression/inhibition to reduce activation of only dominant memories to be replayed, potentially boosting memories that are weakly encoded. Very nice replications of several key replay findings including the effect of reward and remote replay, demonstrating the equally salient cue of context for offline memory consolidation.

      I have no suggestions for the main body of the study, including methods and simulations, as the work is comprehensive, transparent, and well-described. However, I would like to understand how the CMRreplay model fits with the current understanding of the importance of excitation vs inhibition, remembering vs forgetting, activation vs deactivation, strengthening vs elimination of synapses, and even NREM vs REM as Schapiro has modeled. There seems to be a strong association with the efforts of the model to instantiate a memory as well as how that reinstantiation changes across time. But that is not all this is to consolidation. The specific roles of different brain states and how they might change replay is also an important consideration.

      We are gratified that the reviewer appreciated the work, and we agree that the paper would benefit from comment on the connections to these other features of consolidation.

      Excitation vs. inhibition: CMR-replay does not model variations in the excitation-inhibition balance across brain states (as in other models, e.g., Chenkov et al., 2017), since it does not include inhibitory connections. However, we posit that the experience-dependent suppression mechanism in the model might, in the brain, involve inhibitory processes. Supporting this idea, studies have observed increased inhibition with task repetition (Berners-Lee et al., 2022). We hypothesize that such mechanisms may underlie the observed inverse relationship between task experience and replay frequency in many studies. We discuss this in the section titled “A context model of memory replay”:

      “The proposal that a suppression mechanism plays a role in replay aligns with models that regulate place cell reactivation via inhibition (Malerba et al., 2016) and with empirical observations of increased hippocampal inhibitory interneuron activity with experience (Berners-Lee et al., 2022). Our model assumes the presence of such inhibitory mechanisms but does not explicitly model them.”

      Remembering/forgetting, activation/deactivation, and strengthening/elimination of synapses: The model does not simulate synaptic weight reduction or pruning, so it does not forget memories through the weakening of associated weights. However, forgetting can occur when a memory is replayed less frequently than others, leading to reduced activation of that memory compared to its competitors during context-driven retrieval. In the Discussion section, we acknowledge that a biologically implausible aspect of our model is that it implements only synaptic strengthening: 

      “Aspects of the model, such as its lack of regulation of the cumulative positive weight changes that can accrue through repeated replay, are biologically implausible (as biological learning results in both increases and decreases in synaptic weights) and limit the ability to engage with certain forms of low level neural data (e.g., changes in spine density over sleep periods; de Vivo et al., 2017; Maret et al., 2011). It will be useful for future work to explore model variants with more elements of biological plausibility.” Different brain states and NREM vs REM: Reviewer 1 also raised this important issue (see above). We have added the following thoughts on differences between these states and the relationship to our prior work to the Discussion section:

      “Our current simulations have focused on NREM, since the vast majority of electrophysiological studies of sleep replay have identified replay events in this stage. We have proposed in other work that replay during REM sleep may provide a complementary role to NREM sleep, allowing neocortical areas to reinstate remote, already-consolidated memories that need to be integrated with the memories that were recently encoded in the hippocampus and replayed during NREM (Singh et al., 2022). An extension of our model could undertake this kind of continual learning setup, where the student but not teacher network retains remote memories, and the driver of replay alternates between hippocampus (NREM) and cortex (REM) over the course of a night of simulated sleep. Other differences between stages of sleep and between sleep and wake states are likely to become important for a full account of how replay impacts memory. Our current model parsimoniously explains a range of differences between awake and sleep replay by assuming simple differences in initial conditions, but we expect many more characteristics of these states (e.g., neural activity levels, oscillatory profiles, neurotransmitter levels, etc.) will be useful to incorporate in the future.”

      We hope these points clarify the model’s scope and its potential for future extensions.

      Do the authors suggest that these replay systems are more universal to offline processes beyond episodic memory? What about procedural memories and working memory?

      We thank the reviewer for raising this important question. We have clarified in the manuscript:

      “We focus on the model as a formulation of hippocampal replay, capturing how the hippocampus may replay past experiences through simple and interpretable mechanisms.”

      With respect to other forms of memory, we now note that:

      “This motor memory simulation using a model of hippocampal replay is consistent with evidence that hippocampal replay can contribute to consolidating memories that are not hippocampally dependent at encoding (Schapiro et al., 2019; Sawangjit et al., 2018). It is possible that replay in other, more domain-specific areas could also contribute (Eichenlaub et al., 2020).”

      Though this is not a biophysical model per se, can the authors speak to the neuromodulatory milieus that give rise to the different types of replay?

      Our work aligns with the perspective proposed by Hasselmo (1999), which suggests that waking and sleep states differ in the degree to which hippocampal activity is driven by external inputs. Specifically, high acetylcholine levels during waking bias activity to flow into the hippocampus, while low acetylcholine levels during sleep allow hippocampal activity to influence other brain regions. Consistent with this view, our model posits that wake replay is more biased toward items associated with the current resting location due to the presence of external input during waking states. In the Discussion section, we have added a comment on this point:

      “Our view aligns with the theory proposed by Hasselmo (1999), which suggests that the degree of hippocampal activity driven by external inputs differs between waking and sleep states: High acetylcholine levels during wakefulness bias activity into the hippocampus, while low acetylcholine levels during slow-wave sleep allow hippocampal activity to influence other brain regions.”

      Reviewer #3 (Public Review):

      In this manuscript, Zhou et al. present a computational model of memory replay. Their model (CMR-replay) draws from temporal context models of human memory (e.g., TCM, CMR) and claims replay may be another instance of a context-guided memory process. During awake learning, CMR replay (like its predecessors) encodes items alongside a drifting mental context that maintains a recency-weighted history of recently encoded contexts/items. In this way, the presently encoded item becomes associated with other recently learned items via their shared context representation - giving rise to typical effects in recall such as primacy, recency, and contiguity. Unlike its predecessors, CMR-replay has built-in replay periods. These replay periods are designed to approximate sleep or wakeful quiescence, in which an item is spontaneously reactivated, causing a subsequent cascade of item-context reactivations that further update the model's item-context associations.

      Using this model of replay, Zhou et al. were able to reproduce a variety of empirical findings in the replay literature: e.g., greater forward replay at the beginning of a track and more backward replay at the end; more replay for rewarded events; the occurrence of remote replay; reduced replay for repeated items, etc. Furthermore, the model diverges considerably (in implementation and predictions) from other prominent models of replay that, instead, emphasize replay as a way of predicting value from a reinforcement learning framing (i.e., EVB, expected value backup).

      Overall, I found the manuscript clear and easy to follow, despite not being a computational modeller myself. (Which is pretty commendable, I'd say). The model also was effective at capturing several important empirical results from the replay literature while relying on a concise set of mechanisms - which will have implications for subsequent theory-building in the field.

      With respect to weaknesses, additional details for some of the methods and results would help the readers better evaluate the data presented here (e.g., explicitly defining how the various 'proportion of replay' DVs were calculated).

      For example, for many of the simulations, the y-axis scale differs from the empirical data despite using comparable units, like the proportion of replay events (e.g., Figures 1B and C). Presumably, this was done to emphasize the similarity between the empirical and model data. But, as a reader, I often found myself doing the mental manipulation myself anyway to better evaluate how the model compared to the empirical data. Please consider using comparable y-axis ranges across empirical and simulated data wherever possible.

      We appreciate this point. As in many replay modeling studies, our primary goal is to provide a qualitative fit that demonstrates the general direction of differences between our model and empirical data, without engaging in detailed parameter fitting for a precise quantitative fit. Still, we agree that where possible, it is useful to better match the axes. We have updated figures 2B and 2C so that the y-axis scales are more directly comparable between the empirical and simulated data. 

      In a similar vein to the above point, while the DVs in the simulations/empirical data made intuitive sense, I wasn't always sure precisely how they were calculated. Consider the "proportion of replay" in Figure 1A. In the Methods (perhaps under Task Simulations), it should specify exactly how this proportion was calculated (e.g., proportions of all replay events, both forwards and backwards, combining across all simulations from Pre- and Post-run rest periods). In many of the examples, the proportions seem to possibly sum to 1 (e.g., Figure 1A), but in other cases, this doesn't seem to be true (e.g., Figure 3A). More clarity here is critical to help readers evaluate these data. Furthermore, sometimes the labels themselves are not the most informative. For example, in Figure 1A, the y-axis is "Proportion of replay" and in 1C it is the "Proportion of events". I presumed those were the same thing - the proportion of replay events - but it would be best if the axis labels were consistent across figures in this manuscript when they reflect the same DV.

      We appreciate these useful suggestions. We have revised the Methods section to explain in detail how DVs are calculated for each simulation. The revisions clarify the differences between related measures, such as those shown in Figures 1A and 1C, so that readers can more easily see how the DVs are defined and interpreted in each case. 

      Reviewer #4/Reviewing Editor (Public Review):

      Summary:

      With their 'CMR-replay' model, Zhou et al. demonstrate that the use of spontaneous neural cascades in a context-maintenance and retrieval (CMR) model significantly expands the range of captured memory phenomena.

      Strengths:

      The proposed model compellingly outperforms its CMR predecessor and, thus, makes important strides towards understanding the empirical memory literature, as well as highlighting a cognitive function of replay.

      Weaknesses:

      Competing accounts of replay are acknowledged but there are no formal comparisons and only CMR-replay predictions are visualized. Indeed, other than the CMR model, only one alternative account is given serious consideration: A variant of the 'Dyna-replay' architecture, originally developed in the machine learning literature (Sutton, 1990; Moore & Atkeson, 1993) and modified by Mattar et al (2018) such that previously experienced event-sequences get replayed based on their relevance to future gain. Mattar et al acknowledged that a realistic Dyna-replay mechanism would require a learned representation of transitions between perceptual and motor events, i.e., a 'cognitive map'. While Zhou et al. note that the CMR-replay model might provide such a complementary mechanism, they emphasize that their account captures replay characteristics that Dyna-replay does not (though it is unclear to what extent the reverse is also true).

      We thank the reviewer for these thoughtful comments and appreciate the opportunity to clarify our approach. Our goal in this work is to contrast two dominant perspectives in replay research: replay as a mechanism for learning reward predictions and replay as a process for memory consolidation. These models were chosen as representatives of their classes of models because they use simple and interpretable mechanisms that can simulate a wide range of replay phenomena, making them ideal for contrasting these two perspectives.

      Although we implemented CMR-replay as a straightforward example of the memory-focused view, we believe the proposed mechanisms could be extended to other architectures, such as recurrent neural networks, to produce similar results. We now discuss this possibility in the revised manuscript (see below). However, given our primary goal of providing a broad and qualitative contrast of these two broad perspectives, we decided not to undertake simulations with additional individual models for this paper.

      Regarding the Mattar & Daw model, it is true that a mechanistic implementation would require a mechanism that avoids precomputing priorities before replay. However, the "need" component of their model already incorporates learned expectations of transitions between actions and events. Thus, the model's limitations are not due to the absence of a cognitive map.

      In contrast, while CMR-replay also accumulates memory associations that reflect experienced transitions among events, it generates several qualitatively distinct predictions compared to the Mattar & Daw model. As we note in the manuscript, these distinctions make CMR-replay a contrasting rather than complementary perspective.

      Another important consideration, however, is how CMR replay compares to alternative mechanistic accounts of cognitive maps. For example, Recurrent Neural Networks are adept at detecting spatial and temporal dependencies in sequential input; these networks are being increasingly used to capture psychological and neuroscientific data (e.g., Zhang et al, 2020; Spoerer et al, 2020), including hippocampal replay specifically (Haga & Fukai, 2018). Another relevant framework is provided by Associative Learning Theory, in which bidirectional associations between static and transient stimulus elements are commonly used to explain contextual and cue-based phenomena, including associative retrieval of absent events (McLaren et al, 1989; Harris, 2006; Kokkola et al, 2019). Without proper integration with these modeling approaches, it is difficult to gauge the innovation and significance of CMR-replay, particularly since the model is applied post hoc to the relatively narrow domain of rodent maze navigation.

      First, we would like to clarify our principal aim in this work is to characterize the nature of replay, rather than to model cognitive maps per se. Accordingly, CMR‑replay is not designed to simulate head‐direction signals, perform path integration, or explain the spatial firing properties of neurons during navigation. Instead, it focuses squarely on sequential replay phenomena, simulating classic rodent maze reactivation studies and human sequence‐learning tasks. These simulations span a broad array of replay experimental paradigms to ensure extensive coverage of the replay findings reported across the literature. As such, the contribution of this work is in explaining the mechanisms and functional roles of replay, and demonstrating that a model that employs simple and interpretable memory mechanisms not only explains replay phenomena traditionally interpreted through a value-based lens but also accounts for findings not addressed by other memory-focused models.

      As the reviewer notes, CMR-replay shares features with other memory-focused models. However, to our knowledge, none of these related approaches have yet captured the full suite of empirical replay phenomena, suggesting the combination of mechanisms employed in CMR-replay is essential for explaining these phenomena. In the Discussion section, we now discuss the similarities between CMR-replay and related memory models and the possibility of integrating these approaches:

      “Our theory builds on a lineage of memory-focused models, demonstrating the power of this perspective in explaining phenomena that have often been attributed to the optimization of value-based predictions. In this work, we focus on CMR-replay, which exemplifies the memory-centric approach through a set of simple and interpretable mechanisms that we believe are broadly applicable across memory domains. Elements of CMR-replay share similarities with other models that adopt a memory-focused perspective. The model learns distributed context representations whose overlaps encodes associations among items, echoing associative learning theories in which overlapping patterns capture stimulus similarity and learned associations (McLaren & Mackintosh 2002). Context evolves through bidirectional interactions between items and their contextual representations, mirroring the dynamics found in recurrent neural networks (Haga & Futai 2018; Levenstein et al., 2024). However, these related approaches have not been shown to account for the present set of replay findings and lack mechanisms—such as reward-modulated encoding and experience-dependent suppression—that our simulations suggest are essential for capturing these phenomena. While not explored here, we believe these mechanisms could be integrated into architectures like recurrent neural networks (Levenstein et al., 2024) to support a broader range of replay dynamics.”

      Recommendations For The Authors

      Reviewer #1 (Recommendations For The Authors):

      (1) Lines 94-96: These lines may be better positioned earlier in the paragraph.

      We now introduce these lines earlier in the paragraph.

      (2) Line 103 - It's unclear to me what is meant by the statement that "the current context contains contexts associated with previous items". I understand why a slowly drifting context will coincide and therefore link with multiple items that progress rapidly in time, so multiple items will be linked to the same context and each item will be linked to multiple contexts. Is that the idea conveyed here or am I missing something? I'm similarly confused by line 129, which mentions that a context is updated by incorporating other items' contexts. How could a context contain other contexts?

      In the model, each item has an associated context that can be retrieved via Mfc. This is true even before learning, since Mfc is initialized as an identity matrix. During learning and replay, we have a drifting context c that is updated each time an item is presented. At each timestep, the model first retrieves the current item’s associated context cf by Mfc, and incorporates it into c. Equation #2 in the Methods section illustrates this procedure in detail. Because of this procedure, the drifting context c is a weighted sum of past items’ associated contexts. 

      We recognize that these descriptions can be confusing. We have updated the Results section to better distinguish the drifting context from items’ associated context. For example, we note that:

      “We represent the drifting context during learning and replay with c and an item's associated context with cf.”

      We have also updated our description of the context drift procedure to distinguish these two quantities: 

      “During awake encoding of a sequence of items, for each item f, the model retrieves its associated context cf via Mfc. The drifting context c incorporates the item's associated context cf and downweights its representation of previous items' associated contexts (Figure 1c). Thus, the context layer maintains a recency weighted sum of past and present items' associated contexts.”

      (3) Figure 1b and 1d - please clarify which axis in the association matrices represents the item and the context.

      We have added labels to show what the axes represent in Figure 1.

      (4) The terms "experience" and "item" are used interchangeably and it may be best to stick to one term.

      We now use the term “item” wherever we describe the model results. 

      (5) The manuscript describes Figure 6 ahead of earlier figures - the authors may want to reorder their figures to improve readability.

      We appreciate this suggestion. We decided to keep the current figure organization since it allows us to group results into different themes and avoid redundancy. 

      (6) Lines 662-664 are repeated with a different ending, this is likely an error.

      We have fixed this error.

      Reviewer #3 (Recommendations For The Authors):

      Below, I have outlined some additional points that came to mind in reviewing the manuscript - in no particular order.

      (1) Figure 1: I found the ordering of panels a bit confusing in this figure, as the reading direction changes a couple of times in going from A to F. Would perhaps putting panel C in the bottom left corner and then D at the top right, with E and F below (also on the right) work?

      We agree that this improves the figure. We have restructured the ordering of panels in this figure. 

      (2) Simulation 1: When reading the intro/results for the first simulation (Figure 2a; Diba & Buszaki, 2007; "When animals traverse a linear track...", page 6, line 186). It wasn't clear to me why pre-run rest would have any forward replay, particularly if pre-run implied that the animal had no experience with the track yet. But in the Methods this becomes clearer, as the model encodes the track eight times prior to the rest periods. Making this explicit in the text would make it easier to follow. Also, was there any reason why specifically eight sessions of awake learning, in particular, were used?

      We now make more explicit that the animals have experience with the track before pre-run rest recording:

      “Animals first acquire experience with a linear track by traversing it to collect a reward. Then, during the pre-run rest recording, forward replay predominates.”

      We included eight sessions of awake learning to match with the number of sessions in Shin et al. (2017), since this simulation attempts to explain data from that study. After each repetition, the model engages in rest. We have revised the Methods section to indicate the motivation for this choice: 

      “In the simulation that examines context-dependent forward and backward replay through experience (Figs. 2a and 5a), CMR-replay encodes an input sequence shown in Fig. 7a, which simulates a linear track run with no ambiguity in the direction of inputs, over eight awake learning sessions (as in Shin et al. 2019)”

      (3) Frequency of remote replay events: In the simulation based on Gupta et al, how frequently overall does remote replay occur? In the main text, the authors mention the mean frequency with which shortcut replay occurs (i.e., the mean proportion of replay events that contain a shortcut sequence = 0.0046), which was helpful. But, it also made me wonder about the likelihood of remote replay events. I would imagine that remote replay events are infrequent as well - given that it is considerably more likely to replay sequences from the local track, given the recency-weighted mental context. Reporting the above mean proportion for remote and local replay events would be helpful context for the reader.

      In Figure 4c, we report the proportion of remote replay in the two experimental conditions of Gupta et al. that we simulate. 

      (4) Point of clarification re: backwards replay: Is backwards replay less likely to occur than forward replay overall because of the forward asymmetry associated with these models? For example, for a backwards replay event to occur, the context would need to drift backwards at least five times in a row, in spite of a higher probability of moving one step forward at each of those steps. Am I getting that right?

      The reviewer’s interpretation is correct: CMR-replay is more likely to produce forward than backward replay in sleep because of its forward asymmetry. We note that this forward asymmetry leads to high likelihood of forward replay in the section titled “The context-dependency of memory replay”: 

      “As with prior retrieved context models (Howard & Kahana 2002; Polyn et al., 2009), CMR-replay encodes stronger forward than backward associations. This asymmetry exists because, during the first encoding of a sequence, an item's associated context contributes only to its ensuing items' encoding contexts. Therefore, after encoding, bringing back an item's associated context is more likely to reactivate its ensuing than preceding items, leading to forward asymmetric replay (Fig. 6d left).”

      (5) On terminating a replay period: "At any t, the replay period ends with a probability of 0.1 or if a task-irrelevant item is reactivated." (Figure 1 caption; see also pg 18, line 635). How was the 0.1 decided upon? Also, could you please add some detail as to what a 'task-irrelevant item' would be? From what I understood, the model only learns sequences that represent the points in a track - wouldn't all the points in the track be task-relevant?

      This value was arbitrarily chosen as a small value that allows probabilistic stopping. It was not motivated by prior modeling or a systematic search. We have added: “At each timestep, the replay period ends either with a stop probability of 0.1 or if a task-irrelevant item becomes reactivated. (The choice of the value 0.1 was arbitrary; future work could explore the implications of varying this parameter).” 

      In addition, we now explain in the paper that task irrelevant items “do not appear as inputs during awake encoding, but compete with task-relevant items for reactivation during replay, simulating the idea that other experiences likely compete with current experiences during periods of retrieval and reactivation.”

      (6) Minor typos:

      Turn all instances of "nonlocal" into "non-local", or vice versa

      "For rest at the end of a run, cexternal is the context associated with the final item in the sequence. For rest at the end of a run, cexternal is the context associated with the start item." (pg 20, line 663) - I believe this is a typo and that the second sentence should begin with "For rest at the START of a run".

      We have updated the manuscript to correct these typos. 

      (7) Code availability: I may have missed it, but it doesn't seem like the code is currently available for these simulations. Including the commented code in a public repository (Github, OSF) would be very useful in this case.

      We now include a Github link to our simulation code: https://github.com/schapirolab/CMR-replay.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      The manuscript by Raices et al., provides some novel insights into the role and interactions between SPO-11 accessory proteins in C. elegans. The authors propose a model of meiotic DSBs regulation, critical to our understanding of DSB formation and ultimately crossover regulation and accurate chromosome segregation. The work also emphasizes the commonalities and species-specific aspects of DSB regulation. 

      Strengths: 

      This study capitalizes on the strengths of the C. elegans system to uncover genetic interactions between a lSPO-11 accessory proteins. In combination with physical interactions, the authors synthesize their findings into a model, which will serve as the basis for future work, to determine mechanisms of DSB regulation. 

      Weaknesses: 

      The methodology, although standard, still lacks some rigor, especially with the IPs. 

      Reviewer #2 (Public review): 

      Summary: 

      Meiotic recombination initiates with the formation of DNA double-strand break (DSB) formation, catalyzed by the conserved topoisomerase-like enzyme Spo11. Spo11 requires accessory factors that are poorly conserved across eukaryotes. Previous genetic studies have identified several proteins required for DSB formation in C. elegans to varying degrees; however, how these proteins interact with each other to recruit the DSB-forming machinery to chromosome axes remains unclear. 

      In this study, Raices et al. characterized the biochemical and genetic interactions among proteins that are known to promote DSB formation during C. elegans meiosis. The authors examined pairwise interactions using yeast two-hybrid (Y2H) and co-immunoprecipitation and revealed an interaction between a chromatin-associated protein HIM-17 and a transcription factor XND-1. They further confirmed the previously known interaction between DSB-1 and SPO-11 and showed that DSB-1 also interacts with a nematodespecific HIM-5, which is essential for DSB formation on the X chromosome. They also assessed genetic interactions among these proteins, categorizing them into four epistasis groups by comparing phenotypes in double vs. single mutants. Combining these results, the authors proposed a model of how these proteins interact with chromatin loops and are recruited to chromosome axes, offering insights into the process in C. elegans compared to other organisms. 

      Weaknesses: 

      This work relies heavily on Y2H, which is notorious for having high rates of false positives and false negatives. Although the interactions between HIM-17 and XND-1 and between DSB-1 and HIM-5 were validated by co-IP, the significance of these interactions was not tested in vivo. Cataloging Y2H and genetic interactions does not yield much more insight. The model proposed in Figure 4 is also highly speculative. 

      Reviewer #3 (Public review): 

      The goal of this work is to understand the regulation of double-strand break formation during meiosis in C. elegans. The authors have analyzed physical and genetic interactions among a subset of factors that have been previously implicated in DSB formation or the number of timing of DSBs: CEP-1, DSB-1, DSB-2, DSB-3, HIM-5, HIM-17, MRE-11, REC-1, PARG-1, and XND-1. 

      The 10 proteins that are analyzed here include a diverse set of factors with different functions, based on prior analyses in many published studies. The term "Spo11 accessory factors" has been used in the meiosis literature to describe proteins that directly promote Spo11 cleavage activity, rather than factors that are important for the expression of meiotic proteins or that influence the genome-wide distribution or timing of DSBs. Based on this definition, the known SPO-11 accessory factors in C. elegans include DSB-1, DSB2, DSB-3, and the MRN complex (at least MRE-11 and RAD-50). These are all homologs of proteins that have been studied biochemically and structurally in other organisms. DSB-1 & DSB-2 are homologs of Rec114, while DSB-3 is a homolog of Mei4. Biochemical and structural studies have shown that Rec114 and Mei4 directly modulate Spo11 activity by recruiting Spo11 to chromatin and promoting its dimerization, which is essential for cleavage. The other factors analyzed in this study affect the timing, distribution, or number of RAD-51 foci, but they likely do so indirectly. As elaborated below, XND-1 and HIM-17 are transcription factors that modulate the expression of other meiotic genes, and their role in DSB formation is parsimoniously explained by this regulatory activity. The roles of HIM-5 and REC-1 remain unclear; the reported localization of HIM-5 to autosomes is consistent with a role in transcription (the autosomes are transcriptionally active in the germline, while the X chromosome is largely silent), but its loss-of-function phenotypes are much more limited than those of HIM-17 and XND-1, so it may play a more direct role in DSB formation. The roles of CEP-1 (a Rad53 homolog) and PARG-1 are also ambiguous, but their homologs in other organisms contribute to DNA repair rather than DSB formation. 

      We appreciate the reviewer’s clarification. However, the definition of Spo11 accessory factors varies across the literature. Only Keeney and colleagues define these as proteins that physically associate with and activate Spo11 to catalyze DSB formation (Keeney, Lange & Mohibullah, 2014; Lam & Keeney, 2015). In contrast, other authors have used the term more broadly to refer to proteins that promote or regulate Spo11-dependent DSB formation, without necessarily implying a direct interaction with Spo11 (e.g., Panizza et al., 2011; Robert et al., 2016; Stanzione et al., 2016; Li et al., 2021; Lange et al., 2016). Thus, our usage of the term follows this broader functional definition.

      An additional significant limitation of the study, as stated in my initial review, is that much of the analysis here relies on cytological visualization of RAD-51 foci as a proxy for DSBs. RAD-51 associates transiently with DSB sites as they undergo repair and is thus limited in its ability to reveal details about the timing or abundance of DSBs since its loading and removal involve additional steps that may be influenced by the factors being analyzed. 

      We agree with the reviewer that counting RAD-51 foci provides only an indirect measure of SPO-11–dependent DSBs, as RAD-51 marks sites of repair rather than the breaks themselves. However, we would like to clarify that our current study does not rely on RAD51 foci quantification for any of the analyses or conclusions presented. None of the figures or datasets in this manuscript are based on RAD-51 cytology. Instead, our conclusions are drawn from genetic interactions, biochemical assays, and protein–protein interaction analyses.

      The paper focuses extensively on HIM-5, which was previously shown through genetic and cytological analysis to be important for breaks on the X chromosome. The revised manuscript still claims that "HIM-5 mediates interactions with the different accessory factors sub-groups, providing insights into how components on the DNA loops may interact with the chromosome axis." The weak interactions between HIM-5 and DSB-1/2 detected in the Y2H assay do not convincingly support such a role. The idea that HIM-5 directly promotes break formation is also inconsistent with genetic data showing that him5 mutants lack breaks on the X chromosomes, while HIM-5 has been shown to be is enriched on autosomes. Additionally, as noted in my comment to the authors, the localization data for HIM-5 shown in this paper are discordant with prior studies; this discrepancy should be addressed experimentally. 

      We appreciate the reviewer’s concerns regarding the interpretation of HIM-5 function.  The weak Y2H interactions between HIM-5 and DSB-1 are not interpreted as direct biochemical evidence of a strong physical interaction, but rather as a potential point of regulatory connection between these pathways. Importantly, these Y2H data are further supported by co-immunoprecipitation experiments, genetic interactions, and the observed mislocalization of HIM-5 in the absence of DSB-1. Together, these complementary results strengthen our conclusion that HIM-5 functionally associates with DSB-promoting complexes.

      Regarding HIM-5 localization, the pattern we observe using both anti-GFP staining of the eaIs4 transgene (Phim-5::him-5::GFP) and anti-HA staining of the HIM-5::HA strain is consistent with that reported by McClendon et al. (2016), who validated the same eaIs4 transgene. Although the pattern difers slightly from Meneely et al. (2012), that used a HIM5 antibody that is no longer functional and that has been discontinued by the commercial source. In this prior study, a weak signal was detected in the mitotic region and late pachytene, but stronger signal was seen in early to mid-pachytene. Our imaging— optimized for low background and stable signal—similarly shows robust HIM-5 localization in early and mid-pachytene, supporting the reliability of our GFP and HA-tagged analyses.

      The recent analysis of DSB formation in C. elegans males (Engebrecht et al; PloS Genetics; PMID: 41124211) shows that in absence of him-5 there is a significant reduction of CO designation (measured as COSA-1 foci) on autosomes. This study strongly supports a direct and general role for HIM-5 in crossover formation— on both autosomes and on the hermaphrodite X.

      This paper describes REC-1 and HIM-5 as paralogs, based on prior analysis in a paper that included some of the same authors (Chung et al., 2015; DOI 10.1101/gad.266056.115). In my initial review I mentioned that this earlier conclusion was likely incorrect and should not be propagated uncritically here. Since the authors have rebutted this comment rather than amending it, I feel it is important to explain my concerns about the conclusions of previous study. Chung et al. found a small region of potential homology between the C. elegans rec-1 and him-5 genes and also reported that him-5; rec-1 double mutants have more severe defects than either single mutant, indicative of a stronger reduction in DSBs. Based on these observations and an additional argument based on microsynteny, they concluded that these two genes arose through recent duplication and divergence. However, as they noted, genes resembling rec-1 are absent from all other Caenorhabditis species, even those most closely related to C. elegans. The hypothesis that two genes are paralogs that arose through duplication and divergence is thus based on their presence in a single species, in the absence of extensive homology or evidence for conserved molecular function. Further, the hypothesis that gene duplication and divergence has given rise to two paralogs that share no evident structural similarity or common interaction partners in the few million years since C. elegans diverged from its closest known relatives is implausible. In contrast, DSB-1 and DSB-2 are both homologs of Rec114 that clearly arose through duplication and divergence within the Caenorhabditis lineage, but much earlier than the proposed split between REC-1 and HIM-5. Two genes that can be unambiguously identified as dsb-1 and dsb-2 are present in genomes throughout the Elegans supergroup and absent in the Angaria supergroup, placing the duplication event at around 18-30 MYA, yet DSB-1 and DSB-2 share much greater similarity in their amino acid sequence, predicted structure, and function than HIM-5 and REC-1. Further, Raices place HIM-5 and REC-1 in different functional complexes (Figure 3B). 

      We respectfully disagree with the reviewer’s characterization of the relationship between HIM-5 and REC-1. Our use of the term “paralog” follows the conclusions of Chung et al. (2015), a peer-reviewed study that provided both sequence and microsynteny evidence supporting this relationship. While we acknowledge that the degree of sequence conservation is limited, the evolutionary scenario proposed by Chung et al. remains the only published framework addressing this question. Further the degree of homology between either HIM-5 or REC-1 and the ancestral locus are similar to that observed for DSB-1 and DSB-2 with REC-114 (Hinman et al., 2021). We therefore retain the use of the term “paralog” in reference to these genes. Importantly, our conclusions regarding their distinct molecular and functional roles are independent of this classification.

      The authors acknowledge that HIM-17 is a transcription factor that regulates many meiotic genes. Like HIM-17, XND-1 is cytologically enriched along the autosomes in germline nuclei, suggestive of a role in transcription. The Reinke lab performed ChIP-seq in a strain expressing an XND-1::GFP fusion protein and showed that it binds to promoter regions, many of which overlap with the HIM-17-regulated promoters characterized by the Ahringer lab (doi: 10.1126/sciadv.abo4082). Work from the Yanowitz lab has shown that XND-1 influences the transcription of many other genes involved in meiosis (doi: 10.1534/g3.116.035725) and work from the Colaiacovo lab has shown that XND-1 regulates the expression of CRA-1 (doi: 10.1371/journal.pgen.1005029). Additionally, loss of HIM-17 or XND-1 causes pleiotropic phenotypes, consistent with a broad role in gene regulation. Collectively, these data indicate that XND-1 and HIM-17 are transcription factors that are important for the proper expression of many germline-expressed genes. Thus, as stated above, the roles of HIM-17 and XND-1 in DSB formation, as well as their effects on histone modification, are parsimoniously explained by their regulation of the expression of factors that contribute more directly to DSB formation and chromatin modification. I feel strongly that transcription factors should not be described as "SPO-11 accessory factors." 

      The ChIP analysis of XND-1 binding sites (using the XND-1::GFP transgene we provided to the Reinke lab) was performed, and Table S3 in the Ahringer paper suggests it is found at germline promoters, although the analysis is not actually provided. We completely agree that at least a subset of XND-1 functions is explained by its regulation of transcriptional targets (as we previously showed for HIM-5). However, like the MES proteins, a subset of which are also autosomal and impact X chromosome gene expression, XND-1 could also be directly regulating chromatin architecture which could have profound effects on DSB formation.  As stated in our prior comments, precedent for the involvement of a chromatin factor in DSB formation is provided by yeast Spp1. 

      Recommendations for the authors: 

      Editor comments: 

      As you can see, the reviewers have additional comments, and the authors can include revisions to address those points prior to publicizing 'a version of record' (e.g. hatching rate assay mentioned by reviewer #1). This type of study, trying to catalog interactions of many factors, inevitably has loose ends, but in my opinion, it does not reduce the value of the study, as long as statements are not misleading. I suggest that the authors address issues by making changes to the main text. After the next round of adjustments by authors, I feel that it will be ready for a version of record, based on the spirit of the current eLife publication model. 

      Reviewer #1 (Recommendations for the authors): 

      I still have concerns about the HIM-17 IP and immunoblot probing with XND-1 antibodies. While the newly provided whole extract immunoblot clearly shows a XND-1 specific band that goes away in the mutant extracts, there is additional bands that are recognized - the pattern looks different than in the input in Figure 1B. Additionally, there is still a band of the corresponding size in the IPs from extracts not containing the tagged allele of HIM-17, calling into question whether XND-1 is specifically pulled down. 

      The authors did not include the hatching rate as pointed out in the original reviews. In the rebuttal: 

      "Great question. I guess we need to do this while back out for review. If anyone has suggestions of what to say here. Clearly we overlooked this point but do have the strain." 

      We thank the reviewer for this suggestion. We had intended to include a hatching analysis; however, during the course of this work we discovered that our him-17 stock had acquired an additional linked mutation(s) that altered its phenotype and led to inconsistent results. This strain was used to rederive the him-17; eaIs4 double mutant after our original did not survive freeze/thaw. Given the abnormal behavior observed in this line, we concluded that proceeding with the hatching assays could yield unreliable data. We are currently reestablishing a verified him-17 strain, but in the interest of accuracy and reproducibility, we have restricted our analysis in this manuscript to validated datasets derived from confirmed strains.

      Reviewer #2 (Recommendations for the authors): 

      The authors have addressed most of the previous concerns and substantially improved the manuscript. The new data demonstrate that HIM-5 localization depends on DSB-1, and together with the Y2H and Co-PI results, strengthen the link between HIM-5 and the DSBforming machinery in C. elegans. The remaining points are outlined below: 

      Specific comments: 

      The font size of texts and labels in the Figure is very small and is hardly legible. Please enlarge them and make them clearly visible (Fig 1A, 1B, 2A, 2B, 2C, 2D, 2E, 3A, 3B, 3C, 3D, 3F)

      Done

      Although the authors have addressed the specificity of the XND-1 antibody, it remains unclear whether the boxed band is specific to the him-17::3xHA IP, since the same band appears in the control IP, albeit with lower intensity (Fig 1B). Is the ~100 kDa band in the him-17::3xHA IP a modified form XND-1? While antibody specificity was previously demonstrated by IF using xnd-1 mutants, it would be ideal to confirm this on a western blot as well. 

      A Western Blot performed using whole cell extracts and probed with the anti- XND-1 antibody has been provided in the revised version of the manuscript (Fig. S1A). This confirms that the antibody specifically recognizes XND-1 protein. We believe that the ~100 kDa band mentioned by the reviewer is likely to be a non-specific cross reaction band detected by the antibody, since an identical band of the same mW was also detected in xnd-1 null mutants (Fig. S1A).

      Regarding the IP negative controls, we are firmly convinced the boxed band to be specific, and the fact that a (very) low intensity band is also found in the negative control should not infringe the validity of the HIM-17-XND-1 specific interaction. There is a constellation of similar examples present across the literature, as it is widely acknowledged amongst biochemists that some proteins may “stick” to the beads due their intrinsic biochemical properties despite usage of highly stringent IP buffers. However, the high level of enrichment detected in the IP (as also underlined by the reviewer) corroborates that XND-1 specifically immunoprecipitates with HIM-17 despite a low, non-specific binding to the HA beads is present. If interaction between XND-1 and HIM-17 was non-specific, we logically would have found the band in the IP and the band in the negative control to be of very similar intensity, which is clearly not the case. 

      Although co-IP assays are generally considered not a strictly quantitative assay, we want to emphasize that a comparable amount of nuclear extract was employed in both samples as also evidenced by the inputs, in which it is also possible to see that if anything, slightly less  nuclear extracts were employed in the him-17::3xHA; him-5::GFP::3xFLAG vs. the him5::GFP::3xFLAG negative control, corroborating the above mentioned points.

      Lastly, it is crucial to mention that mass spectrometry analyses performed on HIM17::3xHA pulldowns show XND-1 as a highly enriched interacting protein (Blazickova et al.; 2025 Nature Comms.), which strongly supports our co-IP results.

      The subheading "HIM-5 is the essential factor for meiotic breaks in the X chromosome" does not accurately represent the work described in the Results or in Figure 1. I disagree with the authors' response to the earlier criticism. The issue is not merely semantic. The data do not demonstrate that HIM-5 is required for DSB formation on the X chromosome - this conclusion can only be inferred. What Figure 1 shows is that XND-1 and HIM-17 interact, and that pie-1p-driven HIM-5 expression can partially rescue meiotic defects of him-17 mutants. This supports the conclusion that him-5 is a target of HIM-17/XND-1 in promoting CO formation on the X chromosome. However, the data provide no direct evidence for the claim stated in the subheading. I strongly encourage authors to revise the subheading to more accurately represent the findings presented in the paper. 

      After considering the reviewer’s comments, we have revised the subheading to more accurately describe our findings.

      In Fig1C, please fix the typo in the last row - "pie1p::him5-::GFP" to "pie-1p::him- 5::GFP".

      Done

      In Fig 2C, "p" is missing from the label on the right for Phim-5::him-5::GFP.

      Done

      In Fig 3I, bring the labels (DSB-1/2/3) at the lower right to the front.

      Done

      In Concluding Remarks, please fix the typo "frequently".

      Done

      Reviewer #3 (Recommendations for the authors): 

      The experiments that analyze HIM-5 in dsb-1 mutants should be repeated using antibodies against the endogenous HIM-5 antibody, and localization of the HIM-5::HA and HIM-5::GFP proteins should be compared directly to antibody staining. This work uses an epitopetagged protein and a GFP-tagged protein to analyze the localization of HIM-5, while prior work (Meneely et al., 2012) used an antibody against the endogenous protein. In Figures 2 and S4 of this paper, neither HIM-5::HA nor HIM-5::GFP appears to localize strongly to chromatin, and autosomal enrichment of HIM-5, as previously reported for the endogenous protein based on antibody staining, is not evident. Moreover, HIM-5::GFP and HIM-5::HA look different from each other, and neither resembles the low-resolution images shown in Figure 6 in Meneely et al 2012, which showed nuclear staining throughout the germline, including in the mitotic zone, and also in somatic sheath cells. Given the differences in localization between the tagged transgenes and the endogenous protein, it is important to analyze the behavior of the endogenous, untagged protein. A minor issue: a wild-type control should also be shown for HIM-5::HA in Figure S4. 

      Wild type control added to figure S4

      Evidence that XND-1 and HIM-17 form a complex is weak; it is supported by the Y2H and co-IP data but opposed by functional analysis or localization. The diversity of proteins found in the Co-IP of HIM-17::GFP (Table S2) indicate that these interactions are unlikely to be specific. The independent localization of these proteins to chromatin is clear evidence that they do not form an obligate complex; additionally, they have been found to regulate distinct (although overlapping) sets of genes. The predicted structure generated by Alphafold3 has very low confidence and should not be taken as evidence for an interaction.The newly added argument about the lack of apparently overlap between HIM-17 and XND1 due to the distance between the HA tag on HIM-17 and XND-1 is flawed and should be removed - the extended C-terminus in the predicted AlphaFold3 C-terminus of HIM-17 has been interpreted as if it were a structured domain. Moreover, the predicted distance of 180 Å (18 nm) is comparable to the distance between a fluorophore on a secondary antibody and the epitope recognized by the primary antibody (~20-25 nm) and is far below than the resolution limit of light microscopy. 

      We appreciate the reviewer’s thoughtful comment. The evidence supporting a physical interaction between XND-1 and HIM-17 is not only shown by our co-IP experiments, but it has also been recently shown in an independent study where MS analyses were conducted on HIM-17::3xHA pull downs to identify novel HIM-17 interactors (Blazickova et al.; 2025 Nature Comms). As shown in the data provided in this study, also under these experimental settings XND-1 was identified as a highly enriched putative HIM-17 interactor. We do acknowledge that their chromatin localization patterns are distinct and they regulate overlapping but not identical sets of genes, however, it is worth noting that protein–protein interactions in meiosis are often transient or context-dependent, and may not necessarily result in co-localization detectable by microscopy. In line with this, in the same work cited above, a similar situation for BRA-2 and HIM-17 was reported, as they were shown to interact biochemically despite the absence of overlapping staining patterns. 

      Minor issues: 

      The images shown in Panel D in Figure 1 seem to have very different resolutions; the HTP3/HIM-17 colocalization image is particularly blurry/low-resolution and should be replaced. The contrast between blue and green cannot be seen clearly; colors with stronger contrast should be used, and grayscale images should also be shown for individual channels. High-resolution images should probably be included for all of the factors analyzed here to facilitate comparisons.

    1. Author response:

      Reviewer #1:

      We thank the reviewer for this important point. Beyond long reaction times, we did not originally exclude participants based on low EMA variability. We agree this is a relevant concern, particularly given the need to add small random noise to some EMA series for model convergence. In the revised manuscript, we will assess additional indicators of careless responding, including within-person EMA variability (e.g., standard deviation or proportion of modal responses) following Jaso et al., 2022 criteria. We will conduct sensitivity analyses excluding low-variability responses or participants and report whether these checks affect the robustness of the results. We will also clarify in the Discussion that minimal EMA variance may reflect either true affective stability or reduced engagement, and discuss how this ambiguity may affect interpretation.

      Reviewer #2:

      We thank the reviewer for raising this fundamental conceptual concern. We agree that more research is needed to fully understand the processes captured by DQRT. In the revised manuscript, we will more clearly reference and summarize prior validation work from our lab providing strong support for a cognitive characterization of DQRT as a measure of cognitive processing speed, while also explicitly acknowledging potential confounds and limitations (Teckentrup et al., 2025). We will clarify that our DQRT computation followed those validated procedures, including exclusion of extreme values above the sample-specific median + 2 SD. In addition, consistent with Reviewer #1’s comment, we will expand the Discussion of how potential careless responding and non-cognitive factors may influence DQRT. We will further tone down language implying causal inference.

      References

      Jaso, B. A., Kraus, N. I., & Heller, A. S. (2022). Identification of careless responding in ecological momentary assessment research: From posthoc analyses to real-time data monitoring. Psychological Methods, 27(6), 958.

      Teckentrup, V., Rosická, A. M., Donegan, K. R., Gallagher, E., Hanlon, A. K., & Gillan, C. M. (2025). Digital questionnaire response time (DQRT): A ubiquitous and low-cost digital assay of cognitive processing speed. Behavior Research Methods, 57(7), 200.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Drosophila larval type II neuroblasts generate diverse types of neurons by sequentially expressing different temporal identity genes during development. Previous studies have shown that the transition from early temporal identity genes (such as Chinmo and Imp) to late temporal identity genes (such as Syp and Broad) depends on the activation of the expression of EcR by Seven-up (Svp) and progression through the G1/S transition of the cell cycle. In this study, Chaya and Syed examined whether the expression of Syp and EcR is regulated by cell cycle and cytokinesis by knocking down CDK1 or Pav, respectively, throughout development or at specific developmental stages. They find that knocking down CDK1 or Pav either in all type II neuroblasts throughout development or in single-type neuroblast clones after larval hatching consistently leads to failure to activate late temporal identity genes Syp and EcR. To determine whether the failure of the activation of Syp and EcR is due to impaired Svp expression, they also examined Svp expression using a Svp-lacZ reporter line. They find that Svp is expressed normally in CDK1 RNAi neuroblasts. Further, knocking down CDK1 or Pav after Svp activation still leads to loss of Syp and EcR expression. Finally, they also extended their analysis to type I neuroblasts. They find that knocking down CDK1 or Pav, either at 0 hours or at 42 hours after larval hatching, also results in loss of Syp and EcR expression in type I neuroblasts. Based on these findings, the authors conclude that cycle and cytokinesis are required for the transition from early to late temporal identity genes in both types of neuroblasts. These findings add mechanistic details to our understanding of the temporal patterning of Drosophila larval neuroblasts.

      Strengths:

      The data presented in the paper are solid and largely support their conclusion. Images are of high quality. The manuscript is well-written and clear.

      We appreciate the reviewer’s detailed summary and recognition of the study’s strengths.

      Weaknesses:

      The quantifications of the expression of temporal identity genes and the interpretation of some of the data could be more rigorous.

      (1) Expression of temporal identity genes may not be just positive or negative. Therefore, it would be more rigorous to quantify the expression of Imp, Syp, and EcR based on the staining intensity rather than simply counting the number of neuroblasts that are positive for these genes, which can be very subjective. Or the authors should define clearly what qualifies as "positive" (e.g., a staining intensity at least 2x background).

      We thank the reviewer for this helpful suggestion. In the new version, we have now clarified how positive expression was defined and added more details of our quantification strategy to the Methods section (page 11, lines 380-388; lines 426-434 in tracked changes file). Fluorescence intensity for each neuroblast was normalized to the mean intensity of neighboring wild-type neuroblasts imaged in the same field. A neuroblast was considered positive for a given factor when its normalized nuclear intensity was at least 2× the local background. This scoring criterion was applied uniformly across all genotypes and time points. All quantifications were performed on the raw LSM files in Fiji prior to assembling the figure panels.

      (2) The finding that inhibiting cytokinesis without affecting nuclear divisions by knocking down Pav leads to the loss of expression of Syp and EcR does not support their conclusion that nuclear division is also essential for the early-late gene expression switch in type II NSCs (at the bottom of the left column on page 5). No experiments were done to specifically block the nuclear division in this study specifically. This conclusion should be revised.

      We blocked both cell cycle progression and cytokinesis, and both these manipulations affected temporal gene transitions, suggesting that both cell cycle and cytokinesis are essential. To our knowledge, no mechanism/tool exists that selectively blocks nuclear division while leaving cell cycle progression intact. We have added more clarification on page 4, line 123 onwards (lines 126 onwards in tracked changes file).

      (3) Knocking down CDK1 in single random neuroblast clones does not make the CDK1 knockdown neuroblast develop in the same environment (except still in the same brain) as wild-type neuroblast lineages. It does not help address the concern whether "type 2 NSCS with cell cycle arrest failed to undergo normal temporal progression is indirectly due to a lack of feedback signaling from their progeny", as discussed (from the bottom of the right column on page 9 to the top of the left column on page 10). The CDK1 knockdown neuroblasts do not divide to produce progeny and thus do not receive a feedback signal from their progeny as wild-type neuroblasts do. Therefore, it cannot be ruled out that the loss of Syp and EcR expression in CDK1 knockdown neuroblasts is due to the lack of the feedback signal from their progeny. This part of the discussion needs to be clarification.

      Thanks to the reviewer for raising this critical point. We agree and have added more clarification of our interpretations and limitations to our studies in the revised text on page 8, line 278-282 (lines 296-300 in tracked changes file)

      (4) In Figure 2I, there is a clear EcR staining signal in the clone, which contradicts the quantification data in Figure 2J that EcR is absent in Pav RNAi neuroblasts. The authors should verify that the image and quantification data are consistent and correct.

      When cytokinesis is blocked using pav-RNAi, the neuroblasts become extremely large and multinucleated. In some large pav RNAi clones, we observed a weak EcR signal near the cell membrane. However, more importantly, none of the nuclear compartments showed detectable EcR staining, where EcR is typically localized. We selected a representative nuclear image for the figure panel. To clarify this observation, we have now added an explanatory note to the discussion section on page 8, lines 283-291 (lines 301-309 in tracked changes file).

      Reviewer #2 (Public review):

      Summary:

      Neural stem cells produce a wide variety of neurons during development. The regulatory mechanisms of neural diversity are based on the spatial and temporal patterning of neural stem cells. Although the molecular basis of spatial patterning is well-understood, the temporal patterning mechanism remains unclear. In this manuscript, the authors focused on the roles of cell cycle progression and cytokinesis in temporal patterning and found that both are involved in this process.

      Strengths:

      They conducted RNAi-mediated disruption on cell cycle progression and cytokinesis. As they expected, both disruptions affected temporal patterning in NSCs.

      We appreciate the reviewer’s positive assessment of our experimental results.

      Weaknesses:

      Although the authors showed clear results, they needed to provide additional data to support their conclusion sufficiently.

      For example, they need to identify type II NSCs using molecular markers (Ase/Dpn).The authors are encouraged to provide a more detailed explanation of each experiment. The current version of the manuscript is difficult for non-expert readers to understand.

      Thanks for your feedback. We have now included a detailed description of how we identify type II NSCs in both wild-type and mutant clones. We have also added a representative Asense staining to clearly distinguish type 1 (Ase<sup>+</sup>) from type 2 (Ase<sup>-</sup>) NSCs see Figure S1. We have also added a resources table explaining the genotypes associated with each figure, which was omitted due to an error in the previous version of the manuscript. 

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Chaya and Syed focuses on understanding the link between cell cycle and temporal patterning in central brain type II neural stem cells (NSCs). To investigate this, the authors perturb the progression of the cell cycle by delaying the entry into M phase and preventing cytokinesis. Their results convincingly show that temporal factor expression requires progression of the cell cycle in both Type 1 and Type 2 NSCs in the Drosophila central brain. Overall, this study establishes an important link between the two timing mechanisms of neurogenesis.

      Strengths:

      The authors provide solid experimental evidence for the coupling of cell cycle and temporal factor progression in Type 2 NSCs. The quantified phenotype shows an all-ornone effect of cell cycle block on the emergence of subsequent temporal factors in the NSCs, strongly suggesting that both nuclear division and cytokinesis are required for temporal progression. The authors also extend this phenotype to Type 1 NSCs in the central brain, providing a generalizable characterization of the relationship between cell cycle and temporal patterning.

      We thank the reviewer for recognizing the robustness of our data linking the cell cycle to temporal progression.

      Weaknesses:

      One major weakness of the study is that the authors do not explore the mechanistic relationship between the cell cycle and temporal factor expression. Although their results are quite convincing, they do not provide an explanation as to why Cdk1 depletion affects Syp and EcR expression but not the onset of svp. This result suggests that at least a part of the temporal cascade in NSCs is cell-cycle independent, which isn't addressed or sufficiently discussed.

      Thank you for bringing up this important point. We are equally interested in uncovering the mechanism by which the cell cycle regulates temporal gene transitions; however, such mechanistic exploration is beyond the scope of the present study. Interestingly, while the temporal switching factor Svp is expressed independently of the cell cycle, the subsequent temporal transitions are not. We have expanded our discussion on this intriguing finding (page 9, line 307-315; lines 345-355 in tracked changes file). Specifically, we propose that svp activation marks a cell-cycle–independent phase, whereas EcR/Syp induction likely depends on cell-cycle–coupled mechanisms, such as mitosis-dependent chromatin remodeling or daughter-cell feedback. Although further dissection of this mechanism lies beyond the current study, our findings establish a foundation for future work aimed at identifying how developmental timekeeping is molecularly coupled to cell-cycle progression.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) Figure 1 C and D, it would be better to put a question mark to indicate that these are hypotheses to be tested. 

      We appreciate this suggestion and have added question marks in Figure 1C and 1D to clearly indicate that these panels represent hypotheses under investigation clearly.

      (2) Figure 2A-I, Figure 4A-I, Figure 5A-I and K-S, in addition to enlarged views of single type II neuroblasts, it would be more convincing to include zoomed-out images of the entire larval brain or at least a portion of the brain to include neighboring wild-type type II neuroblasts as internal controls. Also, it would be ideal to show EcR staining from the same neuroblasts as IMP and Syp staining. 

      We thank the reviewer for this valuable input. In our imaging setup, the number of available antibody channels was limited to four (anti-Ase, anti-GFP, anti-Syp, and antiImp). Adding EcR in the same sample was therefore not technically possible, we performed EcR staining separately. 

      (3) The authors cited "Syed et al., 2024" (in the middle of the right column on page 5), but this reference is missing in the "References" section and should be added. 

      The missing citation has been added to the reference section.  

      (4) It would be better to include Ase staining in the relevant figure to indicate neuroblast identity as type I or type II. 

      We agree and now include representative Ase staining for both type 1 and type 2 NSC clones in Figure S1, along with corresponding text updates that describe these markers.

      Reviewer #2 (Recommendations for the authors): 

      Major comments 

      (1) The present conclusion relies on the results using Cdk1 RNAi and pav RNAi. It is still possible that Cdk1 and Pav are involved in the regulation of temporal patterning independent of the regulation of cell cycle or cytokinesis, respectively. To avoid this possibility, the authors need to inhibit cell cycle progression or cytokinesis in another alternative manner. 

      We thank the reviewer for raising this important point. While we cannot completely exclude gene-specific, cell-cycle-independent roles for Cdk1 or Pav, we observe consistent phenotypes across several independent manipulations that slow or block the cell cycle. Also, earlier studies using orthogonal approaches that delay G1/S (Dacapo/Rbf) or impair mitochondrial OxPhos (which lengthens G1/S; van den Ameele & Brand, 2019) produce similar temporal delays. These concordant phenotypes strongly support the interpretation that altered cell-cycle progression—rather than specific roles of a single gene—is the primary cause of the defect. While we cannot exclude additional, gene-specific effects of Cdk1 or Pav, the concordant phenotypes across independent perturbations make the cell-cycle disruption model the most parsimonious interpretation. We have clarified this reasoning in the discussion section on pages 8-9, lines 293-305 (lines 311-343 in tracked changes file).

      (2) To reach the present conclusion, the authors need to address the effects of acceleration of cell cycle progression or cytokinesis on temporal patterning. 

      We thank the reviewer for this insightful suggestion. To our knowledge, there are currently no established genetic tools that can specifically accelerate cell-cycle progression in Drosophila neuroblasts. However, our results demonstrate that blocking the cell cycle impairs the transition from early to late temporal gene expression. These findings suggest that proper cell-cycle progression is essential for the transition from early to late temporal identity in neuroblasts.

      Minor comments 

      (3) P3L2 (right), ... we blocked the NSC cell cycle...

      How did they do it? 

      Which fly lines were used?

      Why did they use the line? 

      These details are now included in the Materials and Methods and the Resource Table (pages 11-13). We used Wor-Gal4, Ase-Gal80 to drive UAS-Cdk1RNAi and UASpavRNAi in type 2 NSCs 

      (4) P5L1(left), ... we used the flip-out approach...

      Why did they conduct it? 

      Probably, the authors have reasons other than "to further ensure." 

      We have clarified in the text on page 4, lines 137-139, that the flip-out approach was used to generate random single-cell clones, enabling quantitative analysis of type 2 NSCs within an otherwise wild-type brain. 

      (5) P5L8(left), ... type 2 hits were confirmed by lack of the type 1 Asense...  The authors must examine Deadpan (Dpn) expression as well. Because there are a lot of Asense (Ase) negative cells in the brain (neurons, glial cell, and neuroepithelial cells). 

      Type II NSCs can be identified as Dpn+/Ase- cells.

      We agree that Dpn is a helpful marker. However, we reliably distinguished type II NSCs by their lack of Ase and larger cell size relative to surrounding neurons and glia, which are smaller in size and located deeper within the clone. These differences, together with established lineage patterns, allow unambiguous identification of type 2 NSCs across all genotypes. We have now added representative type I and type 2 NSC clones to the supplemental figure S1 (E-G’) with Asense stains to demonstrate how we differentiate type I from type II NSCs. 

      (6) P5L32(left), To do this, we induced... 

      This sentence should be made more concise.

      Please rephrase it. 

      The sentence has been rewritten for clarity and concision.

      (7)  P5L42(left), ...lack of EcR/Syp expression (Figure 2).  However, EcR expression is still present (Figure 2I). 

      In some large pavRNAi clones, a weak EcR signal can be observed near the cell membrane; however, none of the nuclear compartments—where EcR is typically localized—show detectable staining. We selected a representative nuclear image for the figure and addressed this observation on page 8, lines 283-291 (lines 301-309 in tracked changes file).

      (8) P7L29(left), ......had persistent Imp expression...

      Imp expression is faint compared to that in Figure 2G.

      The differences between Figures 2G and 3G should be discussed. 

      We thank the reviewer for this comment. We have added a note in the Methods section clarifying that brightness and contrast were adjusted per panel for optimal visualization; thus, apparent differences in signal intensity do not reflect biological variation. Fluorescence intensity for each neuroblast was normalized to the mean intensity of neighboring wild-type neuroblasts imaged in the same field. A neuroblast was considered Imp-positive when its normalized nuclear intensity was at least 2× the local background. This scoring criterion was applied uniformly across all genotypes and time points. All quantifications were performed on the raw LSM files in Fiji prior to assembling the figure panels.

      (9) P8 (Figure 5)

      The Imp expression is faint compared to that in Figure 5Q.

      The difference between Figure 5G and 5Q should be discussed further. 

      As mentioned above, we have clarified our image processing approach in the Methods section to explain any differences in signal appearance between these figures.

      (10) P10 Materials and Methods

      The authors did not mention the fly lines used. This is very important for the readers. 

      We thank the reviewer for bringing this oversight to our attention. The Resource Table was inadvertently omitted from the initial submission. The complete list of fly lines and reagents used in this study is now provided in the updated Resource Table.

      Reviewer #3 (Recommendations for the authors): 

      Major points 

      (1) The authors mention that the heat-shock induction at 42ALH is well after svp temporal window and therefore the cell cycle block independently affects Syp and EcR expression. However, Figure 3 shows svp-LacZ expression at 48ALH. If svp expression is indeed transient in Type 2 NSCs, then this must be validated using an immunostaining of the svp-LacZ line with svp antibody. This is crucial as the authors claim that cell cycle block doesn't affect does affect svp expression and is required independently. 

      We thank the reviewer for bringing this important issue to our attention. As noted, Svp protein is expressed transiently and stochastically in type 2 NSCs (Syed et al., 2017), making direct antibody quantification challenging upon cell cycle block. Consistent with previous work (Syed et al., 2017), we used the svp-LacZ reporter line to visualize stabilized Svp expression, which reliably captures Svp expression in type 2 NSCs (Syed et al., 2017 https://doi.org/10.7554/eLife.26287, and Dhilon et al., 2024 https://doi.org/10.1242/dev.202504).

      (2) The authors have successfully slowed down the cell cycle and showed that it affects temporal progression. However, a converse experiment where the cell cycle is sped up in NSCs would be an important test for the direct coupling of temporal factor expression and cell cycle, wherein the expectation would be the precocious expression of late temporal factors in faster cycle NSCs. 

      We agree that such an experiment would be ideal. However, as noted above (Reviewer #2 comment 2), to our knowledge, no suitable tools currently exist to accelerate neuroblast cell-cycle progression without pleiotropic effects.

      Minor point 

      The authors must include Ray and Li (https://doi.org/10.7554/eLife.75879) in the references when describing that "...cell cycle has been shown to influence temporal patterning in some systems,...".  

      We thank the reviewer for this helpful suggestion. The cited reference (Ray and Li, eLife, 2022) has now been included and appropriately referenced in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      Petrovic et al. investigate CCR5 endocytosis via arrestin 2, with a particular focus on clathrin and AP2 contributions. The study is thorough and methodologically diverse. The NMR titration data clearly demonstrate chemical shift changes at the canonical clathrin-binding site (LIELD), present in both the 2S and 2L arrestin splice variants. 

      To assess the effect of arrestin activation on clathrin binding, the authors compare: truncated arrestin (1-393), full-length arrestin, and 1-393 incubated with CCR5 phosphopeptides. All three bind clathrin comparably, whereas controls show no binding. These findings are consistent with prior crystal structures showing peptide-like binding of the LIELD motif, with disordered flanking regions. The manuscript also evaluates a non-canonical clathrin binding site specific to the 2L splice variant. Though this region has been shown to enhance beta2-adrenergic receptor binding, it appears not to affect CCR5 internalization. 

      Similar analyses applied to AP2 show a different result. AP2 binding is activation-dependent and influenced by the presence and level of phosphorylation of CCR5-derived phosphopeptides. These findings are reinforced by cellular internalization assays. 

      In sum, the results highlight splice-variant-dependent effects and phosphorylation-sensitive arrestin-partner interactions. The data argue against a (rapidly disappearing) one-size-fitsall model for GPCR-arrestin signaling and instead support a nuanced, receptor-specific view, with one example summarized effectively in the mechanistic figure.

      We thank the referee for this positive assessment of our manuscript. Indeed, by stepping away from the common receptor models for understanding internalization (b2AR and V2R), we revealed the phosphorylation level of the receptor as a key factor in driving the sequestration of the receptor from the plasma membrane. We hope that the proposed mechanistic model will aid further studies to obtain an even more detailed understanding of forces driving receptor internalization.

      Weaknesses: 

      Figure 1 shows regions alphaFold model that are intrinsically disordered without making it clear that this is not an expected stable position. The authors NMR titration data are n=1. Many figure panels require that readers pinch and zoom to see the data.

      In the “Recommendations for the Authors” section, we addressed the reviewer’s stated weaknesses. In short, for the AlphaFold representation in Figure 1A, we added explicit labeling and revised the legend and main text to clearly state that the depicted loops are intrinsically disordered, absent from crystal structures due to flexibility, and shown only for visualization of their location. We also clarified that the NMR titration experiments inherently have n = 1 due to technical limitations, and that this is standard practice in the field, while ensuring individual data points remain visible. The supplementary NMR figures now have more vibrant coloring, allowing easier data assessment. However, we have not changed the zooming of the microscopy and NMR spectra. We believe that the presentation of microscopy data, which already show zoomed-in regions of interest, follow standard practices in the field. Furthermore, we strongly believe that we should display full NMR spectra in the supplementary figures to allow the reader to assess the overall quality and behavior. As indicated previously, the reader can zoom in to very high resolution, since the spectra are provided by vector graphics. Zoomed regions of the relevant details are provided in the main figures.

      Reviewer #2 (Public review): 

      Summary: 

      Based on extensive live cell assays, SEC, and NMR studies of reconstituted complexes, these authors explore the roles of clathrin and the AP2 protein in facilitating clathrin mediated endocytosis via activated arrestin-2. NMR, SEC, proteolysis, and live cell tracking confirm a strong interaction between AP2 and activated arrestin using a phosphorylated C-terminus of CCR5. At the same time a weak interaction between clathrin and arrestin-2 is observed, irrespective of activation. 

      These results contrast with previous observations of class A GPCRs and the more direct participation by clathrin. The results are discussed in terms of the importance of short and long phosphorylated bar codes in class A and class B endocytosis. 

      Strengths: 

      The 15N,1H and 13C,methyl TROSY NMR and assignments represent a monumental amount of work on arrestin-2, clathrin, and AP2. Weak NMR interactions between arrestin-2 and clathrin are observed irrespective of activation of arrestin. A second interface, proposed by crystallography, was suggested to be a possible crystal artifact. NMR establishes realistic information on the clathrin and AP2 affinities to activated arrestin with both kD and description of the interfaces.

      We sincerely thank the referee for this encouraging evaluation of our work and appreciate the recognition of the NMR efforts and insights into the arrestin–clathrin–AP2 interactions.

      Weaknesses: 

      This reviewer has identified only minor weaknesses with the study. 

      (1) I don't observe two overlapping spectra of Arrestin2 (1393) +/- CLTC NTD in Supp Figure 1

      We believe the referee is referring to Figure 1 – figure supplement 2. We have now made the colors of the spectra more vibrant and used different contouring to make the differences between the two spectra clearer. The spectra are provided as vector graphics, which allows zooming in to the very fine details.

      (2) Arrestin-2 1-418 resonances all but disappear with CCR5pp6 addition. Are they recovered with Ap2Beta2 addition and is this what is shown in Supp Fig 2D

      We believe the reviewer is referring to Figure 3 - figure supplement 1. In this figure, the panels E and F show resonances of arrestin2<sup>1-418</sup> (apo state shown with black outline) disappear upon the addition of CCR5pp6 (arrestin2<sup>1-418</sup>•CCR5pp6 complex spectrum in red). The panels C and D show resonances of arrestin2<sup>1-418</sup> (apo state shown with black outline), which remain unchanged upon addition of AP2b2 <sup>701-937</sup> (orange), indicating no complex formation. We also recorded a spectrum of the arrestin2<sup>1-418</sup>•CCR5pp6 complex under addition of AP2b2 <sup>701-937</sup> (not shown), but the arrestin2 resonances in the arrestin2<sup>1-418</sup> •CCR5pp6 complex were already too broad for further analysis. This had been already explained in the text.

      “In agreement with the AP2b2 NMR observations, no interaction was observed in the arrestin2 methyl and backbone NMR spectra upon addition of AP2b2 in the absence of phosphopeptide (Figure 3-figure supplement 1C, D). However, the significant line broadening of the arrestin2 resonances upon phosphopeptide addition (Figure 3-figure supplement 1E, F) precluded a meaningful assessment of the effect of the AP2b2 addition on arrestin2 in the presence of phosphopeptide”.

      (3) I don't understand how methyl TROSY spectra of arrestin2 with phosphopeptide could look so broadened unless there are sample stability problems?

      We thank the referee for this comment. We would like to clarify that in general a broadened spectrum beyond what is expected from the rotational correlation time does not necessarily correlate with sample stability problems. It is rather evidence of conformational intermediate exchange on the micro- to millisecond time scale.

      The displayed <sup>1</sup>H-<sup>15</sup>N spectra of apo arrestin2 already suffer from line broadening due to such intrinsic mobility of the protein. These spectra were recorded with acquisition times of 50 ms (<sup>15</sup>N) and 55 ms (<sup>1</sup>H) and resolution-enhanced by a 60˚-shifted sine-bell filter for <sup>15</sup>N and a 60˚-shifted squared sine-bell filter for <sup>1</sup>H, respectively, which leads to the observed resolution with still reasonable sensitivity. The <sup>1</sup>H-<sup>15</sup>N resonances in Fig. 1b (arrestin2<sup>1-393</sup>) look particularly narrow. However, this region contains a large number of flexible residues. The full spectrum, e.g. Figure 1-figure supplement 2, shows the entire situation with a clear variation of linewidths and intensities. The linewidth variation becomes stronger when omitting the resolution enhancement filters.

      The addition of the CCR5pp6 phosphopeptide does not change protein stability, which we assessed by measuring the melting temperature of arrestin2<sup>1-418</sup> and arrestin2<sup>1-418</sup>•CCR5pp6 complex (Tm = 57°C in both cases). We believe that the explanation for the increased broadening of the arrestin2 resonances is that addition of the CCR5pp6, possibly due to the release of the arrestin2 strand b20, amplifies the mentioned intermediate timescale protein dynamics. This results in the disappearance of arrestin2 resonances.

      We have now included the assessment of arrestin2<sup>1-418</sup> and arrestin2<sup>1-418</sup>•CCR5pp6 stability in the manuscript:

      “The observed line broadening of arrestin2 in the presence of phosphopeptide must be a result of increased protein motions and is not caused by a decrease in protein stability, since the melting temperature of arrestin2 in the absence and presence of phosphopeptide are identical (56.9 ± 0.1 °C)”.

      (4) At one point the authors added excess fully phosphorylated CCR5 phosphopeptide (CCR5pp6). Does the phosphopeptide rescue resolution of arrestin2 (NH or methyl) to the point where interaction dynamics with clathrin (CLTC NTD) are now more evident on the arrestin2 surface?

      Unfortunately, when we titrate arrestin2 with CCR5pp6 (please see Isaikina & Petrovic et. al, Mol. Cell, 2023 for more details), the arrestin2 resonances undergo fast-to-intermediate exchange upon binding. In the presence of phosphopeptide excess, very few resonances remain, the majority of which are in the disordered region, including resonances from the clathrin-binding loop. Due to the peak overlap, we could not unambiguously assign arrestin2 resonances in the bound state, which precluded our assessment of the arrestin2-clathrin interaction in the presence of phosphopeptide. We have made this now clearer in the paragraph ‘The arrestin2-clathrin interaction is independent of arrestin2 activation’

      “Due to significant line broadening and peak overlap of the arrestin2 resonances upon phosphopeptide addition, the influence of arrestin activation on the clathrin interaction could not be detected on either backbone or methyl resonances “.

      (5) Once phosphopeptide activates arrestin-2 and AP2 binds can phosphopeptide be exchanged off? In this case, would it be possible for the activated arrestin-2 AP2 complex to re-engage a new (phosphorylated) receptor?

      This would be an interesting mechanism. In principle, this should be possible as long as the other (phosphorylated) receptor outcompetes the initial phosphopeptide with higher affinity towards the binding site. However, we do not have experiments to assess this process directly. Therefore, we rather wish not to further speculate.

      (6) I'd be tempted to move the discussion of class A and class B GPCRs and their presumed differences to the intro and then motivate the paper with specific questions. 

      We appreciate the referee’s suggestion and had a similar idea previously. However, as we do not have data on other class-A or class-B receptors, we rather don’t want to motivate the entire manuscript by this question.

      (7) Did the authors ever try SEC measurements of arrestin-2 + AP2beta2+CCR5pp6 with and without PIP2, and with and without clathrin (CLTC NTD? The question becomes what the active complex is and how PIP2 modulates this cascade of complexation events in class B receptors.

      We thank the referee for this question. Indeed, we tested whether PIP2 can stabilize the arrestin2•CCR5pp6•AP2 complex by SEC experiments. Unfortunately, the addition of PIP2 increased the formation of arrestin2 dimers and higher oligomers, presumably due to the presence of additional charges. The resolution of SEC experiments was not sufficient to distinguish arrestin2 in oligomeric form or in arrestin2•CCR5pp6•AP2 complex. We now mention this in the text:

      “We also attempted to stabilize the arrestin2-AP2b2-phosphopetide complex through the addition of PIP2, which can stabilize arrestin complexes with the receptor (Janetzko et al., 2022). The addition of PIP2 increased the formation of arrestin2 dimers and higher oligomers, presumably due to the presence of additional charges. Unfortunately, the resolution of the SEC experiments was not sufficient to separate the arrestin2 oligomers from complexes with AP2b2”.

      Reviewer #3 (Public review): 

      Summary: 

      Overall, this is a well-done study, and the conclusions are largely supported by the data, which will be of interest to the field. 

      Strengths: 

      Strengths of this study include experiments with solution NMR that can resolve high-resolution interactions of the highly flexible C-terminal tail of arr2 with clathrin and AP2. Although mainly confirmatory in defining the arr2 CBL376LIELD380 as the clathrin binding site, the use of the NMR is of high interest (Fig. 1). The 15N-labeled CLTC-NTD experiment with arr2 titrations reveals a span from 39-108 that mediates an arr2 interaction, which corroborates previous crystal data, but does not reveal a second area in CLTC-NTD that in previous crystal structures was observed to interact with arr2. 

      SEC and NMR data suggest that full-length arr2 (1-418) binding with 2-adaptin subunit of AP2 is enhanced in the presence of CCR5 phospho-peptides (Fig. 3). The pp6 peptide shows the highest degree of arr2 activation, and 2-adaptin binding, compared to less phosphorylated peptide or not phosphorylated at all. It is interesting that the arr2 interaction with CLTC NTD and pp6 cannot be detected using the SEC approach, further suggesting that clathrin binding is not dependent on arrestin activation. Overall, the data suggest that receptor activation promotes arrestin binding to AP2, not clathrin, suggesting the

      AP2 interaction is necessary for CCR5 endocytosis. 

      To validate the solid biophysical data, the authors pursue validation experiments in a HeLa cell model by confocal microscopy. This requires transient transfection of tagged receptor (CCR5-Flag) and arr2 (arr2-YFP). CCR5 displays a "class B"-like behavior in that arr2 is rapidly recruited to the receptor at the plasma membrane upon agonist activation, which forms a stable complex that internalizes onto endosomes (Fig. 4). The data suggest that complex internalization is dependent on AP2 binding not clathrin (Fig. 5). 

      The addition of the antagonist experiment/data adds rigor to the study. 

      Overall, this is a solid study that will be of interest to the field.

      We thank the referee for the careful and encouraging evaluation of our work. We appreciate the recognition of the solidity of our data and the support for our conclusions regarding the distinct roles of AP2 and clathrin in arrestin-mediated receptor internalization.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors): 

      I believe that the authors have made efforts to improve the accessibility to a broader audience. In a few cases, I believe that the authors response either did not truly address the concern or made the problem worse. I am grouping these as 'very strong opinions' and 'sticking point'. 

      Very strong opinion 1: 

      While data presentation is somewhat at the authors discretion, there were several figures where the presentation did not make the work approachable, including microscopy insets and NMR spectra. A suggestion to 'pinch and zoom' does not really address this. For the overlapping NMR spectra in supporting Figure 1, I actually -can- see this on zooming, but I did not recognize this on first pass because the colors are almost identical for the two spectra. This is an easy fix. Changing the presentation by coloring these distinctly would alleviate this. The Supplemental figure to Fig. 2 looks strange with pinch and zoom. But at the end of the day, data presentation where the reader is to infer that they must zoom in is not very approachable and may prevent readers from being able to independently assess the data. In this case, there doesn't seem to be a strong rationale to not make these panels easier to see at 100% size. 

      We appreciate the reviewer’s thoughtful comments regarding figure accessibility and agree that data presentation should be clear and interpretable without requiring readers to zoom in extensively. However, we must note that the presentation of the microscopy data follows standard practices in the field and that the panels already include zoomed-in regions, which enable easier access to key results and observations.

      Regarding the NMR data, we have revised Figure 1—figure supplement 2 and Figure 2— figure supplement 1 to match the presentation style of Figure 3—figure supplement 1, which the reviewer apparently found more accessible. We also made the colors of the spectra more vibrant, as the referee suggested. We would like to emphasize that it is absolutely necessary to display the full NMR spectra in order to allow independent assessment of signal assignment, data quality, and overall protein behavior. Zoomed regions of the relevant details are provided in the main figures.

      Very strong opinion 2: 

      The author's response to lack of individual data points and error bars is that this is an n=1 experiment. I do not believe this meets the minimum standard for best practices in the field.

      We respectfully disagree with the reviewer’s assessment. The Figure already displays individual data points, as shown already in the initial submission. Performing NMR titrations with isotopically labeled protein samples is inherently resource-intensive, and single-sample (n = 1) experiments are widely accepted and routinely reported in the field. Numerous studies have used the same approach, including Rosenzweig et al., Science (2013); Nikolaev et al., Nat. Methods (2019); and Hobbs et al., J. Biomol. NMR (2022), as well as our own recent work (Isaikina & Petrovic et al., Mol. Cell, 2023). These studies demonstrate that such NMR-based affinity measurements, even when performed on a single sample, are highly reproducible, precise, and consistent with orthogonal evidence and across different sample conditions.

      Sticking point:

      Figure 1A - the alphaFold model of arrestin2L depicts the disordered loops as ordered. The depiction is misleading at best, and inaccurate in truth. To use an analogy, what the authors depict is equivalent to publishing an LLM hallucination in the text. Unlike LLMs, alphaFold will actually flag its hallucination with the confidence (pLDDT) in the output. Both for LLMs and for alphaFold, we are spending much time teaching our students in class how to use computation appropriately - both to improve efficiency but also to ensure accuracy by removing hallucinations.

      The original review indicated that confidences needed to be shown and that this needed to be depicted in a way that clarifies that this is NOT a structural state of the loops. The newly added description ("The model was used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 Cdomain, which are not detected in the available crystal structures...) worsens the concern because it even more strongly implies that a 0 confidence computational output is a likely structural state. It also indicates that these regions were 'not detected' in crystal structures. These regions of arrestin are intrinsically disordered. AlphaFold (by it's nature) must put out something in terms of coordinates, even if the pLDDT suggests that the region cannot be predicted or is not in a stable position, which is the case here. In crystal structures, these regions are not associated with interpretable electron density, meaning that coordinates are omitted in these regions because adding them would imply that under the conditions used, the protein adopts a low energy structural state in this region. This region is instead intrinsically disordered. 

      A good description of why showing disordered loops in a defined position is incorrect and how to instead depict disorder correctly is in Brotzakis et al. Nat communications 16, 1632 (2025) "AlphaFold prediction of structural ensembles of disordered proteins", where figures 3A, 4A, and 5A show one AlphaFold prediction colored by confidence and 3B, 4B and 5B are more accurate depictions of the structural ensemble. 

      Coming back to the original comment "The AlphaFold model could benefit from a more transparent discussion of prediction confidence and caveats. The younger crowd (part of the presumed intended readership) tends to be more certain that computational output is 'true'...." Right now, the authors are still showing in Fig 1A a depiction of arrestin with models for the loops that are untrue. They now added text indicating that these loops are visualized in an AlphaFold prediction and 'true' but 'not detected in crystal structures'. There is no indication in the text that these are intrinsically disordered. The lack of showing the pLDDT confidence and the lack of any indication that these are disordered regions is simply incorrect. 

      We appreciate the concern of the reviewer towards AlphaFold models. As NMR spectroscopists we are highly aware of intrinsic biomolecular motions. However, our AlphaFold2 model is used as a graphical representation to display the interaction sites of loops; it is not intended to depict the loops as fixed structural states. The flexibility of the loops had been clearly described in the main text before:

      “Arrestin2 consists of two consecutive (N- and C-terminal) β-sandwich domains (Figure 1A), followed by the disordered clathrin-binding loop (CBL, residues 353–386), strand b20 (residues 386–390), and a disordered C-terminal tail after residue 393”.

      and

      “Figure 1B depicts part of a 1H-15N TROSY spectrum (full spectrum in Figure 1-figure supplement 2A) of the truncated 15N-labeled arrestin2 construct arrestin21-393 (residues 1393), which encompasses the C-terminal strand β20, but lacks the disordered C-terminal tail. Due to intrinsic microsecond dynamics, the assignment of the arrestin21-393 1H-15N resonances by triple resonance methods is largely incomplete, but 16 residues (residues 367381, 385-386) within the mobile CBL could be assigned. This region of arrestin is typically not visible in either crystal or cryo-EM structures due to its high flexibility”.

      as well as in the legend to Figure 1:

      “The model was used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 C-domain, which are not detected in the available crystal structures of apo arrestin2 [bovine: PDB 1G4M (Han et al., 2001), human: PDB 8AS4 (Isaikina et al., 2023)]. In the other structured regions, the model is virtually identical to the crystal structures”.

      We have now further added a label ‘AlphaFold2 model’ to Figure 1A and amended the respective Figure legend to

      “The model was used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 C-domain, which are not detected in the available crystal structures of apo arrestin2 [bovine: PDB 1G4M (Han et al., 2001), human: PDB 8AS4 (Isaikina et al., 2023)] due to flexibility. In the other structured regions, the model is virtually identical to the crystal structures”.

      Reviewer #2 (Recommendations for the authors): 

      I appreciated the response by the authors to all of my questions. I have no further comments

      We thank the referee for the raised questions, which we believe have improved the quality of the manuscript.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We sincerely thank all three reviewers for their constructive comments. We deeply appreciate the reviewers’ efforts in summarizing our study, highlighting its strengths, and providing constructive suggestions. To enhance the quality and clarity of our work, we plan to address the concerns raised by the reviewers.

      First, as Reviewer #1 suggested, we will note that clearer expression patterns of Wnt10b and Fgf2 may be detectable in scRNA-seq analyses at other stages, and we will also clarify that low-level signals of epithelial and CT/fibroblast markers outside their expected clusters may reflect technical bias. In addition, we agree with the reviewer’s point that our unsuccessful ISH experiments and the low abundance detected by RT-qPCR do not demonstrate absence of expression, and that conclusions from reanalyzing the Li et al. scRNA-seq dataset can depend strongly on analytical choices; therefore, while we focused on the 7 dpa sample because our RT-qPCR data suggested that Wnt10b and Fgf2 may be most enriched around the MB stage (the original study refers to 7 dpa as MB), we will explicitly acknowledge that analyzing a single time point—especially one with a low representation of epithelial cells—may yield incomplete or stage-biased interpretations, and that inclusion of additional time points could reveal clearer and potentially different expression patterns. We will also temper our wording regarding the inferred cellular sources to avoid over-interpretation based on the current data.

      Second, to mitigate the concerns raised by Reviewer #3 regarding the generalization of our conclusions to amputation-induced (normal) limb regeneration, we will cite a previous study suggesting that ALM was used as the alternative experimental system for studying limb regeneration (Nacu et al., 2016, Nature, PMID: 27120163; Satoh et al., 2007, Developmental Biology, PMID: 17959163). We are confident that our ALM-based data provide a reasonable basis for understanding limb regeneration. We agree that there are important remaining questions—such as which cell populations express Wnt10b and Fgf2 and how endogenous WNT10B and FGF2 signals induce Shh expression in normal regeneration—which should be investigated in future studies to deepen our understanding of limb regeneration.

      We also appreciate Reviewer #2’s careful evaluation of the technical rigor and quantification. We have benefited from the reviewer’s earlier feedback, which guided revisions that improved the manuscript’s rigor and presentation.

      We are grateful for the reviewers’ insights and are confident that these revisions will significantly strengthen our manuscript.


      The following is the authors’ response to the original reviews.

      Recommendations for the authors:

      Reviewing Editor Comments:

      The authors should be commended for addressing this gap - how cues from the DV axis interact with the AP axis during limb regeneration. Overall, the concept presented in this manuscript is extremely interesting and could be of high value to the field. However, the manuscript in its current form is lacking a few important data and resolution to fully support their conclusions, and the following needs to be addressed before publication:

      (1) ISH data on Wnt10b and FGF2 from various regeneration time points are essential to derive the conclusion. Preferably multiplex ISH of Wnt10b/Fgf2/Shh or at least canonical ISH on serial sections to demonstrate their expression in dermis/epidermis and order of gene expression i.e. Shh is only expressed after expression of Wnt10b/FGF2. It would certainly help if this can also be shown in regular blastema.

      We are grateful for the constructive suggestion on assessing Wnt10b and Fgf2 expression during regular regeneration, and we agree that clarifying their expression patterns in regular blastemas is important for strengthening the conclusions of our study. Because we cannot currently ensure sufficient sensitivity with multiplex FISH in our laboratory—partly due to high background—, we conducted conventional ISH on serial sections of regular blastemas at several time points (Fig. S5A). However, the expression patterns of Wnt10b and Fgf2 were not clear. To complement the ISH results, we performed RT-qPCR on microdissected dorsal and ventral halves of regular blastemas at the MB stage (Fig. S5B). We found that Wnt10b and Fgf2 were expressed at significantly higher levels in the dorsal and ventral halves, respectively, compared to the opposite half. This dorsal/ventral biased expression of Wnt10b/Fgf2 is consistent with our RNA-seq data. We further quantified expression levels of Wnt10b, Fgf2, and Shh across stages (intact, EB, MB, LB, and ED) and found that Wnt10b and Fgf2 peaked at the MB stage, whereas Shh peaked at the LB stage—consistent with the editor’s request regarding the order of gene expression (Fig. S5C). This temporal offset in upregulation supports our model. These results are now included in the revised manuscript (Line 294‒306).

      To identify the cell types expressing Wnt10b or Fgf2, we analyzed published single-cell RNA-seq data (7 dpa blastema (MB), Li et al., 2021). As a result, Fgf2 expression was observed in the mesenchymal cluster, whereas Wnt10b expression was observed in both mesenchymal and epithelial clusters (Fig. S6). However, because only a small fraction of cells expressed Wnt10b, the principal cellular source of WNT10B protein remains unclear. The apparent low abundance likely contributes to the weak ISH signals and reflects current technical limitations. In addition, Wnt10b and Fgf2 expression did not follow Lmx1b expression (Fig. S6J, K), and Wnt10b and Fgf2 themselves were not exclusive (Fig. S6L). These results are now included in the revised manuscript (Line 307‒321). Together with the RT-qPCR data (Fig. S5B), these results suggest that Wnt10b and Fgf2 are not exclusively confined to purely dorsal or ventral cells at the single-cell level, even though they show dorsoventral bias when assessed in bulk tissue. These results suggest that Wnt10b/Fgf2 expression is not restricted to dorsal/ventral cells but mediated by dorsal/ventral cells, and co-existence of both signals should provide a permissive environment for Shh induction. Defining the precise spatial patterns of Wnt10b and Fgf2 in regular regeneration will therefore be an important goal for future work.  

      (2) Validation of the absence of gene expression via qRT PCR in the given sample will increase the rigor, as suggested by reviewers.

      We thank for this important suggestion and agree that validation by qRT-PCR increases the rigor of our study. Accordingly, we performed RT-qPCR on AntBL, PostBL, DorBL, and VentBL to corroborate the ISH results. The results are now included in Fig. 2. We also verified by RT-qPCR that Shh expression following electroporation and the quantitative results are now provided in Fig. 5.

      (3) Please increase n for experiments where necessary and mention n values in the figures.

      We thank for this helpful comment and agree on the importance of providing sufficient sample sizes. Accordingly, we increased the n for the relevant experiments and have indicated the n values in the corresponding figure legends.

      (4) Most comments by all three reviewers are constructive and largely focus on improving the tone and language of the manuscript, and I expect that the authors should take care of them.

      We thank the reviewers for their constructive feedback on the tone and language of the manuscript. We have carefully revised the text according to each comment, and we hope these modifications have improved both clarity and readability.

      In addition, in revising the manuscript we also refined the conceptual framework. Our new analysis of Wnt10b and Fgf2 expression during normal regeneration suggests that these genes are not expressed in a strictly dorsal- or ventral-specific manner at the single-cell level. When these observations are considered together with (i) the RNA-seq comparison of dorsally and ventrally induced ALM blastemas, (ii) RT-qPCR of microdissected dorsal and ventral halves of regenerating blastemas, and (iii) the functional electroporation experiments, our interpretation is that Wnt10b and Fgf2 act as dorsal- and ventral-mediated signals, respectively: their production is regulated by dorsal or ventral cells, and the presence of both signals is required to induce Shh expression. Given those, we now think our conclusion might be explained without using the confusing term, “positional cue”. Because the distinction between “positional cue” and “positional information” could be confusing as noted by the reviewers, we rewrote our manuscript without using “positional cue.

      Reviewer #1 (Recommendations for the authors):

      (1) Line 61: More explanation for what a double-half limb means is needed.

      We thank the reviewer for this suggestion. We have revised the manuscript (Line 73‒76). Specifically, we now explain that a double-dorsal limb, for example, is a chimeric limb generated by excising the ventral half and replacing it with a dorsal half from the contralateral limb while preserving the anteroposterior orientation.

      (2) Line 63-65: "Such blastemas form hypomorphic, spike-like structures or fail to regenerate entirely." This statement does not represent the breadth of work on the APDV axis in limb regeneration. The cited Bryant 1976 reference tested only double-posterior and double-anterior newt limbs, demonstrating the importance of disposition along the AP axis, not DV. Others have shown that the regeneration of double-half limbs depends upon the age of the animal and the length of time between the grafting of double-half limbs and amputation. Also, some double-dorsal or double-ventral limbs will regenerate complete AP axes with symmetrical DV duplications (Burton, Holder, and Jesani, 1986). Also, sometimes half dorsal stylopods regenerate half dorsal and half ventral, or regenerate only half ventral, suggesting there are no inductive cues across the DV axis as there are along the AP axis. Considering this is the basis of the study under question, more is needed to convince that the DV axis is necessary for the generation of the AP axis.

      We thank the reviewer for this detailed and constructive comment. We acknowledge that previous studies have reported a range of outcomes for double-half limbs. For example, Burton et al. (1986) described regeneration defects in double-dorsal (DD) and double-ventral (VV) limbs, although limb patterning did occur in some cases (Burton et al., 1986, Table 1). As the reviewer notes, regenerative outcomes depend on variables such as animal age and the interval between construction of the double-half limb and amputation, sometimes called the effect of healing time (Tank and Holder, 1978). Moreover, variability has been reported not only in DD/VV limbs but also in double-anterior (AA) and double-posterior (PP) limbs (e.g., Bryant, 1976; Bryant and Baca, 1978; Burton et al., 1986). In the revised manuscript, we have therefore modified the statement to avoid over-generalization and to emphasize that regeneration can be incomplete under these conditions (Line 76‒82). Importantly, in order to provide the additional evidence requested and to directly re-evaluate whether dorsal and ventral cells are required for limb patterning, we performed the ALM experiments shown in Fig. 1. The ALM system allows us to assess this question in a binary manner (regeneration vs. non-regeneration), thereby strengthening the rationale for our conclusions regarding the necessity of the APDV orientations. We also revised a sentence at the beginning of the Results section to emphasize this point (Line 139‒140).

      (3) Line 71: These findings suggest that specific signals from all four positional domains must be integrated for successful limb patterning, such that the absence of any one of them leads to failure." I was under the impression that half posterior limbs can grow all elements, but half anterior can only grow anterior elements.

      We thank the reviewer for this helpful clarification. As summarized by Stocum, half-limb experiments show that while some digit formation can occur, limb patterning remains incomplete in both anterior-half and posterior-half limbs in some cases (Stocum, 2017). We see this point as closely related to the broader question of whether proper limb patterning requires the integration of signals from all four positional domains. As noted in our response above, our ALM experiments in Fig. 1 were designed to test this point directly, and our data support the interpretation that cells from all four orientations are necessary for correct limb patterning.

      (4) Line 79-81: This is stated later in lines 98-105. I suggest expanding here or removing it here.

      We thank the reviewer for this suggestion. In the original version, lines 79–81 introduced our use of the terms “positional cue” and “positional information,” and this content partially overlapped with what later appeared in lines 98–105. In the revised manuscript, we have substantially rewritten this section (Line 82‒84), including the sentences corresponding to lines 79–81 in the original version, to remove the term “positional cue,” as explained in our response to the Editor’s comment (4); our revision reflects new analyses indicating that Wnt10b and Fgf2 appear not be strictly restricted to dorsal or ventral cell populations, and we now describe these factors as dorsal- or ventral-mediated signals that act across dorsoventral domains to induce Shh expression. Accordingly, we no longer maintain the original use of “positional cue” and “positional information.”

      (5) Line 92 - 93: "Similarly, an ALM blastema can be induced in a position-specific manner along the limb axes. In this case, the induced ALM blastema will lack cells from the opposite side." This sentence is difficult to follow. Isn't it the same thing stated in lines 88-90?

      We thank the reviewer for this comment. We revised the sentence to improve readability and to avoid redundancy with original Lines 88–90 (Line 104‒106).

      (6) Line 107: I think the appropriate reference is McCusker et al., 2014 (Position-specific induction of ectopic limbs in non-regenerating blastemas on axolotl forelimbs), although Vieira et al., 2019 can be included here. In addition, Ludolph et al 1990 should be cited.

      We thank the reviewer for this suggestion. We have added McCusker et al. (2014) and Ludolph et al. (1990) as references in the revised manuscript (Line 120‒121).

      (7) Line 107-109: A missing point is how the ventral information is established in the amniote limb. From what I remember, it is the expression of Engrailed 1, which inhibits the ventral expression of Wnt7a, and hence Lmx1b. This would suggest that there is no secreted ventral cue. This is a relatively large omission in the manuscript.

      We thank the reviewer for this comment. We agree that ventral fate in amniotes is specified by En1 in the ventral ectoderm, which represses Wnt7a and thereby prevents induction of Lmx1b; accordingly, a secreted ventral morphogen analogous to dorsal Wnt7a has not been established. We added this point to the revised Introduction (Line 61‒64).

      By contrast, in axolotl limb regeneration, our previous work on Lmx1b expression suggests that DV identities reflect the original positional identity rather than being re-specified during regeneration (Yamamoto et al., 2022). Within this framework, our original use of the term “ventral positional cue” does not imply a ventral patterning morphogen in the amniote sense; rather, it denotes downstream signals induced by cells bearing ventral identity that are required for the blastema to form a patterned limb. This interpretation is consistent with classic studies on double-half chimeras and ectopic contacts between opposite regions (Iten & Bryant, 1975; Bryant & Iten, 1976; Maden, 1980; Stocum, 1982) as well as with our ALM data (Fig. 1). For this reason, we intentionally used the term “positional cues” to refer to signals provided by cells bearing ventral identity, which can be considered separable from the DV patterning mechanism itself, in the original text. As explained in our response to the Editor’s comment (4), we describe these signals as “signals mediated by dorsal/ventral cells,” rather than “positional cues” in the revised manuscript.

      The necessity of dorsal- and ventral-mediated signals is supported by classic studies on the double-half experiment. In the non-regenerating cases, structural patterns along the anteroposterior axis appear to be lost even though both anterior and posterior cells should, in principle, be present in a blastema induced from a double-dorsal or double-ventral limbs. In limb development of amniotes, Wnt7a/Lmx1b or En-1 mutants show that limbs can exhibit anteroposterior patterning even when tissues are dorsalized or ventralized—that is, in the relative absence of ventral or dorsal cells, respectively (Riddle et al., 1995; Chen et al., 1998; Loomis et al., 1996). Taken together, axolotl limb regeneration, in which the presence of both dorsal and ventral cells plays a role in anteroposterior patterning, should differ from other model organisms. It is reasonable to predict the dorsal- and ventral-mediated signals in axolotl limb regeneration. We included this point in the revised manuscript (Line 82‒89). However, there is no evidence that these signals are secreted molecules. For this reason, we have carefully used the term “dorsal-/ventral-mediated signals” in the Introduction without implying secretion.

      (8) Introduction - In general, the argument is a bit misleading. It is written as if it is known that a ventral cue is necessary, but the evidence from other animal models is lacking, from what I know. I may be wrong, but further argument would strengthen the reasoning for the study.

      We thank the reviewer for this thoughtful comment. We agree that it should not read as if it is known that a ventral cue is necessary. In the revised Introduction, we have addressed this in several ways. First, as described in our response to comment (7), we now explicitly note that in amniote limb development ventral identity is specified by En1-mediated repression of Wnt7a, and that a secreted ventral morphogen equivalent to dorsal Wnt7a has not been established. Second, we removed the term “positional cue” and no longer present “ventral positional cue” as a defined entity. Instead, we use mechanistic phrasing such as “signals mediated by ventral cells” and “signals mediated by dorsal cells,” which does not assume that such signals are secreted morphogens or universally conserved. Third, we have reframed the role of dorsal- and ventral-mediated signals as a working hypothesis specific to axolotl limb regeneration, rather than as a general conclusion across model systems.

      (9) Line 129: Remove "As mentioned before".

      We thank the reviewer for this suggestion. We have removed the phrase “As mentioned before” in the revised manuscript (Line 143).

      (10) Figure 1: Are Lmx1, Fgf8, and Shh mutually exclusive? Multiplexed FISH would provide this information, and is a relatively important question considering the strong claims in the study.

      We thank the reviewer for raising this important point. As noted in our response to the editor’s comment, we cannot currently ensure sufficiently high detection sensitivity with multiplex FISH in our laboratory. However, based on previous reports (Nacu et al., 2016), Fgf8 and Shh should be mutually exclusive. In contrast, with respect to Lmx1b, our analysis suggests that its expression is not mutually exclusive with either Fgf8 or Shh, at least their expression domains. To confirm this, we analyzed the published scRNA-seq data and the results were added to the supplemental figure 6. Fgf8 and Shh were expressed in both Lmx1b-positive and Lmx1b-negative cells (Fig. S6H, I), but Fgf8 and Shh themselves were mutually exclusive (Fig. S6M). This point is now included in the revised manuscript (Line 314‒317).

      (11) Results section and Figure 2: More evidence is needed for the lack of Shh expression ISH in tissue sections. Demonstrating the absence of something needs some qPCR or other validation to make such a claim.

      We thank the reviewer for this suggestion. We performed qRT-PCR on ALM blastemas to complement the ISH data (Fig. 2).

      (12) Line 179: I think they are likely leucistic d/d animals and not wild-type animals based upon the images.

      We thank the reviewer for this observation. In the revised manuscript, we have corrected the description to “leucistic animals” (Line 194).

      (13) Line 183-186: I'm a bit confused about this interpretation. If Shh turns on in just a posterior blastema, wouldn't it turn on in a grafted posterior tissue into a dorsal or ventral region? Isn't this independent of environment, meaning Shh turns on if the cells are posterior, regardless of environment?

      Our interpretation is that only posterior-derived cells possess the competency to express Shh. In other words, whether a cell is capable of expressing Shh depends on its original positional identity (Iwata et al., 2020), but whether it actually expresses Shh depends on the environment in which the cell is placed. The results of Fig. 3E and G indicate that Shh activation is dependent on environment and that the posterior identity is not sufficient to activate Shh expression. We have revised the manuscript to emphasize this distinction more clearly (Line 198‒203).

      (14) Figure 4: Do the limbs have an elbow, or is it just a hand?

      We thank the reviewer for this thoughtful question. From the appearance, an elbow-like structure can occasionally be seen; however, we did not examine the skeletal pattern in detail because all regenerated limbs used for this analysis were sectioned for the purpose of symmetry evaluation, and we therefore cannot state this conclusively. While this is indeed an important point, analyzing proximodistal patterning would require a very large number of additional experiments, which falls outside the main focus of the present study. For this reason, and also to minimize animal use in accordance with ethical considerations, we did not pursue further experiments here. In response to this point, we have added a description of the skeletal morphology of ectopic limbs induced by BMP2+FGF2+FGF8 bead implantation (Fig. 6). In these experiments, multiple ectopic limbs were induced along the same host limb. In most cases, these ectopic limbs did not show fusion with the proximal host skeleton, similar to standard ALM-induced limbs, although in one case we observed fusion at the stylopod level. We now note this observation in the revised manuscript (Line 347‒354).

      We regard the relationship between APDV positional information and proximodistal patterning as an important subject for future investigation.

      (15) Line 203 - 237: I appreciate the symmetry score to estimate the DV axis. Are there landmarks that would better suggest a double-dorsal or double-ventral phenotype, like was done in the original double-half limb papers?

      We thank the reviewer for this thoughtful comment. In most cases, the limbs induced by the ALM exhibit abnormal and highly variable morphologies compared to normal limbs, making it difficult to apply consistent morphological landmarks as used in the original double-half limb studies. For this reason, we focused our analysis on “morphological symmetry” as a quantitative measure of DV axis patterning, and we have added this explanation to the manuscript (Line 232‒235). Additionally, we provided transverse sections along the proximodistal axis as supplemental figures (Figs. S2 and S4). In addition to reporting the symmetry score, we have explicitly stated in the text that symmetry was also assessed by visual inspection of these sections.

      (16) Line 245-247: The experiment was done using bulk sequencing, so both the epithelium and mesenchyme were included in the sample. The posterior (Shh) and anterior (Fgf8) patterning cues are mesenchymally expressed. In amniotes, the dorsal cue has been thought to be Wnt7a from the epithelium. Can ISH, FISH, or previous scRNAseq data be used to identify genes expressed in the mesenchyme versus epithelium? This is very important if the authors want to make the claim for defining "The molecular basis of the dorsal and ventral positional cues" as was stated by the authors.

      We thank the reviewer for highlighting this important point. As the reviewer notes, our bulk RNA-seq data do not distinguish between epithelial and mesenchymal expression domains. As noted in our response to the editor’s comment, we performed ISH and qPCR on regular blastemas. However, these approaches did not provide definitive information regarding the specific cell types expressing Wnt10b and Fgf2. To complement this, we re-analyzed publicly available single-cell RNA-seq data (from Li et al., 2021). As a results, Fgf2 was expressed mainly by the mesenchymal cells, and Wnt10b expression was observed in both mesenchymal and epithelial cells. These results are now included in the revised manuscript (Line 294‒321) and in supplemental figures (Fig. S6, S7).

      (17) Was engrailed 1, lmx1b, or Wnt7a differentially expressed along the DV axis, suggesting similar signaling between? Are these expressed in mesenchyme? Previous work suggests Wnt7a is expressed throughout the mesenchyme, but publicly available scRNAseq suggests that it is expressed in the epithelium.

      We thank the reviewer for this important comment. As noted, the reported expression patterns of DV-related genes are not consistent across studies, which likely reflects the technical difficulty of detecting these genes with high sensitivity. In our own experiments, expression of DV markers other than Lmx1b has been very weak or unclear by ISH. Whether these genes are expressed in the epithelium or mesenchyme also appears to vary depending on the detection method used. In our RNA-seq dataset, Wnt7a expression was detected at very low levels and showed no significant difference along the DV axis, while En1 expression was nearly absent. We have clarified these results in the revised manuscript (Line 437‒441). Our reanalysis of the published scRNA-seq likewise detected Wnt7a in only a very small fraction of cells. Accordingly, we consider it premature to reach a definitive conclusion—such as whether Wnt7a is broadly mesenchymal or restricted to epithelium—as suggested in prior reports. We also note that whether Wnt7a is epithelial or mesenchymal does not affect the conclusions or arguments of the present study. Although the roles of Wnt7a and En1 in axolotl DV patterning are certainly important, we feel that drawing a definitive conclusion on this issue lies beyond the scope of the present study, and we have therefore limited our description to a straightforward presentation of the data.

      (18) Line 247-249: The sentence suggests that all the ligands were tried. This should be included in the supplemental data.

      We thank the reviewer for this clarification. In fact, we tested only Wnt4, Wnt10b, Fgf2, Fgf7, and Tgfb2, and all of these results are presented in the figures. To avoid misunderstanding, we have revised the text to explicitly state that our analysis focused on these five genes (Line 272‒274).

      (19) Line 249: An n =3 seems low and qPCR would be a more sensitive means of measuring gene induction compared to ISH. The ISH would confirm the qPCR results. Figure 5C is also not the most convincing image of Shh induction without support from a secondary method.

      We have increased the sample size for these experiments (Line 277‒280). In addition, to complement the ISH results, we confirmed Shh induction by qPCR following electroporation of Wnt10b and Fgf2 (Fig. 5D, E). In addition, because Shh signal in the Wnt10b-electroporated VentBL images was particularly weak and difficult to discern, we replaced that panel with a representative example in which Shh signal is more clearly visible. These data are now included in the revised manuscript (Line 280‒282).

      (20) Line 253: It is confusing why Wnt10b, but not Wnt4 would work? As far as I know, both are canonical Wnt ligands. Was Wnt7a identified as expressed in the RNAseq, but not dorsally localized? Would electroporation of Wnt7a do the same thing as Wnt10b and hence have the same dorsalizing patterning mechanisms as amniotes?

      We thank the reviewer for raising this challenging but important question. Wnt10b was identified directly from our bulk RNA-seq analysis, as was Wnt4. The difference in the ability of Wnt10b and Wnt4 to induce Shh expression in VentBL may reflect differences in how these ligands activate downstream WNT signaling programs. WNT10B is a potent activator of the canonical WNT/β-catenin pathway (Bennett et al., 2005), although WNT10B has also been reported to trigger a β-catenin–independent pathway (Lin et al., 2021). By contrast, WNT4 can signal through both canonical and non-canonical (β-catenin–independent) pathways, and the balance between these outputs is known to depend on cellular context (Li et al., 2013; Li et al., 2019). Consistent with a requirement for canonical WNT signaling, we found that pharmacological activation of canonical WNT signaling with BIO (a GSK3 inhibitor) was also sufficient to induce Shh expression in VentBL. However, despite this, it is still unclear why Wnt10b, but not Wnt4, was able to induce Shh under our experimental conditions. One possible explanation is that different WNT ligands can engage the same receptors (e.g., Frizzled/LRP6) yet can drive distinct downstream transcriptional programs (This may depend on the state of the responding cells, as Voss et al. predicted), resulting in ligand-specific outputs (Voss et al., 2025). This point is now included in the revised discussion section (Line 402‒412). At present, we cannot distinguish between these possibilities experimentally, and we therefore refrain from making a stronger mechanistic claim.

      With respect to Wnt7a, we detected Wnt7a expression at very low levels, and without a clear dorsoventral bias, in our RNA-seq analysis of ALM blastemas (we describe this point in Line 437‒440). This is consistent with previous work suggesting that axolotl Wnt7a is not restricted to the dorsal region in regeneration. Because of this low and unbiased expression, and because our data already implicated Wnt10b as a dorsal-mediated signal that can act across dorsoventral domains to permit Shh induction, we did not prioritize Wnt7a electroporation in the present study. We therefore cannot conclude whether Wnt7a would behave similarly to Wnt10b in this context.

      Importantly, these uncertainties about ligand-specific mechanisms do not alter our main conclusion. Our data support the idea that a dorsal-mediated WNT signal (represented here by WNT10B and canonical WNT activation) and a ventral-mediated FGF signal (FGF2) must act together to permit Shh induction, and that the coexistence of these dorsal- and ventral-mediated signals is required for patterned limb formation in axolotl limb regeneration.

      (21) Is canonical Wnt signaling induced after electroporation of Wnt10b or Wnt4? qPCR of Lef1 and axin is the most common way of showing this.

      We thank the reviewer for this helpful suggestion. In addition to examining Shh expression, we also assessed canonical WNT signaling by qPCR analysis of Axin2 and Lef1 following Wnt10b electroporation. The data is now included in Fig. 5.

      (22) Line 255-256: qPCR was presented for Figure 5D, but ISH was used for everything else. Is there a technical reason that just qPCR was used for the bead experiments?

      We thank the reviewer for this helpful comment. In the original submission, our goal was to test whether treatment with commercial FGF2 protein or BIO could reproduce the results obtained by electroporation. In the revised manuscript, to avoid confusion between distinct experimental aims, we removed the FGF2–bead data from this section and instead used RT-qPCR to quantitatively corroborate Shh induction after electroporation (Fig. 5D–E). RT-qPCR provided a sensitive, whole-blastema readout and allowed a paired design (left limb: factor; right limb: GFP control) that increased statistical power while minimizing animal use. To address the reviewer’s point more directly, we additionally performed ISH for the BIO treatment and now include those results in Supplementary Figure 3 (Line 287‒288).

      (23) Line 261-263: The authors did not show where Wnt10B or Fgf2 is expressed in the limb as claimed. The RNAseq was bulk, so ISH of these genes is needed to make this claim. Where are Wnt10b and Fgf2 expressed in the amputated limb? Do they show a dorsal (Wnt10b) and ventral (Fgf2) expression pattern?

      We thank the reviewer for raising this important point. As noted in our response to the editor’s comment, we performed ISH on serial sections of regular blastemas at several time points (Fig. S5A). However, the expression patterns of Wnt10b and Fgf2 along the dorsoventral axis were not clear. To complement the ISH results, we performed RT-qPCR on microdissected dorsal and ventral halves of regular blastemas at the MB stage (Fig. S5B). We found that Wnt10b and Fgf2 were expressed at significantly higher levels in the dorsal and ventral halves, respectively, compared to the opposite half. This dorsal/ventral biased expression of Wnt10b/Fgf2 is consistent with our RNA-seq data. To identify the cell types expressing Wnt10b or Fgf2, we analyzed published single-cell RNA-seq data (7 dpa blastema (MB), Li et al., 2021). As a result, Fgf2 expression was observed in the mesenchymal cluster, whereas Wnt10b expression was observed in both mesenchymal and epithelial clusters (Fig. S6). However, because only a small fraction of cells expressed Wnt10b, the principal cellular source of WNT10B protein remains unclear. The apparent low abundance likely contributes to the weak ISH signals and reflects current technical limitations. In addition, Wnt10b and Fgf2 expression did not follow Lmx1b expression (Fig. S6J, K), and Wnt10b and Fgf2 themselves were not exclusive (Fig. S6L). Together with the RT-qPCR data (Fig. S5B), these results suggest that Wnt10b and Fgf2 are not exclusively confined to purely dorsal or ventral cells at the single-cell level, even though they show dorsoventral bias when assessed in bulk tissue, suggesting that Wnt10b/Fgf2 expression is not dorsal-/ventral-specific but mediated by dorsal/ventral cells. Defining the precise spatial patterns of Wnt10b and Fgf2 in regular regeneration will therefore be an important goal for future work. These points are now included in the revised manuscript (Line 485‒501).

      (24) Line 266-288: The formation of multiple limbs is impressive. Do these new limbs correspond to the PD location they are generated?

      We thank the reviewer for this interesting question. Interestingly, from our observations, there does appear to be a tendency for the induced limbs to vary in length depending on their PD location. The skeletal patterns of the induced multiple limbs are now included in Fig. 6. However, as noted earlier, the supernumerary limbs exhibit highly variable morphologies, and a rigorous analysis of PD correlation would require a large number of induced limbs. Since this lies outside the main focus of the present study, we have not pursued this point further in the manuscript.

      (25) Line 288: The minimal requirement for claiming the molecular basis for DV signaling was identified is to ISH or multiplexed FISH for Wnt10b and Fgf2 in amputated limb blastemas to show they are expressed in the mesenchyme or epithelium and are dorsally and ventrally expressed, respectively. In addition, the current understanding of DV patterning through Wnt7a, Lmx1b, and En1 shown not to be important in this model.

      We thank the reviewer for this comment and fully agree with the point raised. We would like to clarify that we are not claiming to have identified the molecular basis of DV patterning. As the reviewer notes, molecules such as Lmx1b, Wnt7a, and En1 are well identified in other animal models as key regulators of DV positional identity. There is no doubt that these molecules play central roles in DV patterning. However, in axolotl limb regeneration, clear DV-specific expression has not been demonstrated for these genes except for Lmx1b. Therefore, further studies will be required to elucidate the molecular basis of DV patterning in axolotls.

      Our focus here is more limited: we aim to identify the molecular basis for the mechanisms in which positional domain-mediated signals (FGF8, SHH, WNT10B, and FGF2) regulate the limb patterning process, rather than the molecular basis of DV patterning. In fact, our results on Wnt10b and Fgf2 suggest that these genes did not affect dorsoventral identities.

      We recognize that this distinction was not sufficiently clear in the original text, and we have revised the manuscript to describe DV patterning mechanisms in other animals and clarify that the dorsal- and ventral-mediated signals are distinct from DV patterning (Line 444‒450). At least, we avoid claiming that the molecular basis for DV signaling was identified.

      (26) Line 335: References are needed for this statement. From what I found, Wnt4 can be canonical or non-canonical.

      We thank the reviewer for this helpful comment. We have revised the manuscript (Line 404‒407). We added these citations at the relevant location and adjusted nearby wording to avoid implying pathway exclusivity, in alignment with our response to comment (20).

      (27) Line 337-338: The authors cannot claim "that canonical, but not non-canonical, WNT signaling contributes to Shh induction" as this was not thoroughly tested is based upon the negative result that Wnt4 electroporation did not induce Shh expression.

      We thank the reviewer for this important clarification. We agree that our data do not allow us to conclude that non-canonical WNT signaling in general does not contribute to Shh induction. Accordingly, we have removed the phrase “but not non-canonical” and revised the text to emphasize that, within the scope of our experiments, Shh induction was not observed following Wnt4 electroporation, whereas it was observed with Wnt10b.

      (28) Line 345: In order to claim "WNT10B via the canonical WNT pathway...appears to regulate Shh expression" needs at least qPCR to show WNT10B induces canonical signaling.

      We thank the reviewer for this comment. As noted in our response to comment (21), we also assessed canonical WNT signaling by qPCR analysis of Axin2 and Lef1 following Wnt10b electroporation (Line 282‒285).

      (29) Lines 361-372: A few studies have been performed on DV patterning of the mouse digit regeneration in regards to Lmx1b and En1. It may be good to discuss how the current study aligns with these findings.

      We appreciate the reviewer’s suggestion. As the reviewer refers, several studies have been performed on dorsoventral (DV) patterning in mouse digit tip regeneration in relation to Lmx1b and En1 (e.g., Johnson et al., 2022; Castilla-Ibeas et al., 2023). In the present study, however, our main conclusion is different in the scope of studies on mouse digit tip regeneration. We show that, in the axolotl, pre-existing dorsal and ventral identities (as reflected by dorsally derived and ventrally derived cells in the ALM blastema) are required together to induce Shh expression, and that this Shh induction in turn supports anteroposterior interaction at the limb level. This mechanism—dorsal-mediated and ventral-mediated signals acting in combination to permit Shh expression—does not have a clear direct counterpart in the mouse digit tip literature. Moreover, even with respect to Lmx1b, the two systems behave differently. In mouse digit tip regeneration, loss of Lmx1b during regeneration does not grossly affect DV morphology of the regenerate (Johnson et al., 2022). By contrast, in our axolotl ALM system, the presence or absence of Lmx1b-positive dorsal tissue correlates with the final dorsoventral organization of the induced limb-like structures (e.g., production of double-dorsal or double-ventral symmetric structures in the absence of appropriate dorsoventral contact). Thus, the role of dorsoventral identity in our model is directly tied to patterned limb outgrowth at the whole-limb scale, whereas in the mouse digit tip it has been reported primarily in the context of digit tip regrowth and bone regeneration competence, not robust DV repatterning (Johnson et al., 2022).

      For these reasons, we believe that an extended discussion of mouse digit tip regeneration would risk implying a mechanistic equivalence between axolotl limb regeneration and mouse digit tip regeneration that is not supported by current data. Because the regenerative contexts differ, and because Lmx1b does not appear to re-establish DV patterning in the mouse regenerates (Johnson et al., 2022), we have chosen not to include an explicit discussion of mouse digit tip regeneration in the main text.

      (30) Line 408-433: Although I appreciate generating a model, this section takes some liberties to tell a narrative that is not entirely supported by previous literature or this study. For example, lines 415-416 state "Wnt10b and Fgf2 are expressed at higher levels in dorsal and the ventral blastemal cells, respectively" which were not shown in the study or other studies.

      We thank the reviewer for this important comment. We agree that the original model based on RNA-seq data overstated the evidence. To address this point experimentally, we examined Wnt10b and Fgf2 expression in regular blastemas (Supplemental Figure 5 and 6). Accordingly, our model is now framed as an inductive mechanism for Shh expression—supported by results in ALM (WNT10B in VentBL; FGF2 in DorBL) and by DV-biased expression. Concretely, the sentence previously paraphrased as “Wnt10b and Fgf2 are expressed at higher levels in dorsal and ventral blastemal cells, respectively” has been replaced with wording that (i) avoids single-cell DV specificity and (ii) emphasizes dorsal-/ventral-mediated regulation and the requirement for both signals to allow Shh induction (Line 510‒511).

      Reviewer #2 (Recommendations for the authors):

      (1) Introduction:

      The authors' definitions of positional cues vs positional information are a little hard to follow, and do not appear to be completely accurate. From my understanding of what the authors explain, "positional information" is defined as a signal that generates positional identities in the regenerating tissue. This is a somewhat different definition than what I previously understood, which is the intrinsic (likely epigenetic) cellular identity associated with specific positional coordinates. On the other hand, the authors define "positional cues" as signals that help organize the cells according to the different axes, but don't actually generate positional identities in the regenerating cells. The authors provide two examples: Wnt7a as an example of positional information, and FGF8 as a positional cue. I think that coording to the authors definitions, FGF8 (and probobly Shh) are bone fide positional cues, since both signals work together to organize the regenerating limb cells - yet do not generate positional identities, because ectopic limbs formed from blastemas where these pathways have been activated do not regenerate (Nacu et al 2016). However, I am not sure Wnt7a constitutes an example of a "positional information" signal, since as far as I know, it has not been shown to generate stable dorsal limb identities (that remain after the signal has stopped) - at least yet. If it has, the authors should cite the paper that showed this. I think that some sort of diagram to help define these visually will be really helpful, especially to people who do not study regenerative patterning.

      We thank the reviewer for this thoughtful comment. We now agree with the reviewer that our use of “positional cue” and “positional information” may have been confusing. In the revision—and as noted in our response to the Editor’s comment (4)—we have removed the term “positional cue” and no longer attempt to contrast it with “positional information.” Instead, we adopt phrasing that reflects our data and hypothesis: during limb patterning, dorsal-mediated signals act on ventral cells and ventral-mediated signals act on dorsal cells to induce Shh expression. This wording avoids implying that these signals specify dorsoventral identity.

      Regarding WNT7A, we agree it has not been shown to generate a stable dorsal identity after signal withdrawal. In the revised Introduction we therefore describe WNT7A in amniote limb development as an extracellular regulator that induces Lmx1b in dorsal mesenchyme (with En1 repressing Wnt7a ventrally), rather than labeling it as “positional information” in a strict, identity-imprinting sense. We highlight this contrast because, in our axolotl experiments, WNT10B and FGF2 did not alter Lmx1b expression or dorsal–ventral limb characteristics when overexpressed, consistent with the idea that they act downstream of DV identity to enable Shh induction, not to establish DV identity.

      (2) Results:

      It would be helpful if the number of replicates per sample group were reported in the figure legends.

      We thank the reviewer for this suggestion. In accordance with the comment, we have added the number of replicates (n) for each sample group in the figure legends.

      Figure 2 shows ISH for A/P and D/V transcripts in different-positioned blastemas without tissue grafts. The images show interesting patterns, including the lack of Shh expression in all blastemas except in posterior-located blastemas, and localization of the dorsal transcript (Lmx1b) to the dorsal half of A or P located blastemas. My only concern about this data is that the expression patterns are described in only a small part of the ectopic blastema (how representative is it?) and the diagrams infer that these expression patterns are reflective of the entire blastema, which can't be determined by the limited field of view. It is okay if the expression patterns are not present in the entire blastema -in fact, that might be an important observation in terms of who is generating (and might be receiving) these signals.

      We thank the reviewer for this insightful comment. Because Fgf8 and Shh expression was detectable only in a limited subset of cells, the original submission included only high-magnification images. In response to the reviewer’s valid concern about representativeness, we have now added low-magnification overviews of the entire blastema as a supplemental figure (Fig. S1) and clarified in the figure legend that these expression patterns can be focal rather than pan-blastemal (Line 795‒796).

      In Figure 3, they look at all of these expression patterns in the grafted blastemas, showing that Shh expression is only visible when both D and V cells are present in the blastema. My only concern about this data is that the number of replicates is very low (some groups having only an N=3), and it is unclear how many sections the authors visualized for each replicate. This is especially important for the sample groups where they report no Shh expression -I agree that it is not observable in the single example sections they provide, but it is uncertain what is happening in other regions of the blastema.

      We thank the reviewer for this important comment. To increase the reliability of the results, we have increased the number of biological replicates in groups where n was previously low. For all samples, we collected serial sections spanning the entire blastema. For blastemas in which Shh expression was observed, we present representative sections showing the signal. For blastemas without detectable Shh expression, we selected a section from the central region that contains GFP-positive cells for the Figure. To make these points explicit, we have added the following clarification to the Fig. 3 legend (Line 811‒815).

      Figure 4: Shh overexpression in A/P/D/V blastemas - expression induces ectopic limbs in A/D/V locations. They analyzed the symmetry of these regenerates (assuming that Do and V located blastemas will exhibit D/V symmetry because they only contain cells from one side of that axis. I am a little concerned about how the symmetry assay is performed, since oblique sections through the digits could look asymmetric, while they are actually symmetric. It is also unclear how the angle of the boxes that the symmetry scores were based on was decided - I imagine that the score would change depending on the angle. It also appears that the authors picked different digits to perform this analysis on the different sample groups. I also admit that the logic of classification scheme that the authors used AI to perform their symmetry scoring analysis (both in Figures 4 and 5) is elusive to me. I think it would have been more informative if the authors leveraged the structural landmarks, like the localization of specific muscle groups. (If this experiment were performed in WT animals, the authors could have used pigment cell localization)... or generate more proximal sections to look at landmarks in the zeugopod.

      We thank the reviewer for these detailed comments regarding the symmetry analysis. Because reliance on a computed symmetry score alone could raise the concerns noted by the reviewer, we now provide transverse sections along the proximodistal axis as supplemental figures (Figs. S2 and S4). These include levels corresponding to the distal end of the zeugopod and the proximal end of the autopod. In addition to reporting the symmetry score, we have explicitly stated in the text that symmetry was also assessed by visual inspection of these sections.

      As also noted in our response to Reviewer #1 (comment 15), ALM-induced limbs frequently exhibit abnormal and highly variable morphologies, which makes it difficult to use consistent anatomical landmarks such as particular digits or muscle groups. For this reason, we focused our analysis on morphological symmetry rather than landmark-based metrics, and we emphasize this rationale in the revised text (Line 232‒235).

      Regarding the use of bounding boxes, this procedure was chosen to minimize the effects of curvature or fixation-induced distortion. For each section, the box angle was adjusted so that the outer contour (epidermal surface) was aligned symmetrically; this procedure was applied uniformly across all conditions to avoid bias. We analyzed multiple biological replicates in each group, which helps mitigate potential artifacts due to oblique sectioning. To further reduce bias, we increased the number of fields included in the analysis to n = 24 per group in the revised version.

      In addition, staining intensity varied among samples, such that a region identified as “muscle” in one sample could be assigned differently in another if classification were based solely on color. To avoid this problem, we used a machine-learning classifier trained separately for each sample, allowing us to group the same tissues consistently within that sample irrespective of intensity differences. In the context of ALM-induced limbs, where stable anatomical landmarks are not available, we consider this strategy the most appropriate. We have added this rationale to the revised manuscript for clarity (Line 239‒247).

      Figure 5: The number of replicates in sample groups is relatively low and is quite variable between groups (ranging between 3 and 7 replicates). Zoom in to visualize Shh expression is small relative to the blastema, and it is difficult to discern why the authors positioned the window where they did, and how they maintained consistency among their different sample groups. In the examples of positive Shh expression - the signal is low and hard to see. Validating these expression patterns using some sort of quantitative transcriptional assay (like qRTPCR) would increase the rigor of this experiment ... especially given that they will be able to analyze gene expression in the entire blastema as opposed to sections that might not capture localized expression.

      We thank the reviewer for this important comment. To increase the rigor of these experiments, we have increased the number of biological replicates in groups where n was previously low. In addition, because Shh signal in the Wnt10b-electroporated VentBL images was particularly weak and difficult to discern, we replaced that panel with a representative example in which Shh signal is more clearly visible. We also validated the Shh expression for Wnt10b–electroporated VentBL and Fgf2–electroporated DorBL by RT-qPCR, which assesses gene expression across the entire blastema. These results are now included in Fig. 5 and Line 280‒282. Finally, we clarified in the figure legend how the “window” for imaging was chosen: for samples with detectable Shh expression, the window was placed in the region where the signal was observed; for conditions without detectable Shh expression, the window was positioned in a comparable region containing GFP-positive cells (Line 836‒839). These revisions are included in the revised manuscript.

      Figure 6: They treat dorsal and ventral wounds with gelatin beads soaked in a combination of BMP2+FGF8 (nerve factors) and FGF2 proposed ventral factor). Remarkably, they observe ectopic limb expression in only dorsal wounds, further supporting the idea that FGF2 provides the "ventral" signal. They show examples of this impressive phenotype on limbs with multiple ectopic structures that formed along the Pr/Di axis. Including images of tubulin staining (as they have in Figures 1 and 2) to ensure that the blastemas (or final regenerates) are devoid of nerves. The authors' whole-mount skeletal staining which shows fusion of the ectopic humerus with the host humerus, is a phenotype associated with deep wounding, which could provide an opportunity for more cellular contribution from different limb axes.

      We thank the reviewer for these constructive comments. As noted in the prior study, when beads are used to induce blastemas without surgical nerve orientation, fine nerve ingrowth can still occur (Makanae et al., 2014), and the induced blastemas are not completely devoid of nerves. While it is still uncertain whether these recruited nerves are functional after blastema induction, it is an important point, and we added sentences about this in the revised manuscript (Line 341‒345).

      Regarding the skeletal phenotype, despite careful implantation to avoid injuring deep tissues, bead-induced ectopic limbs on the dorsal side occasionally displayed fusion of the stylopod with the host humerus—a phenotype associated with deep wounding, as the reviewer notes. This observation suggests that contributions from a broader cellular population cannot be excluded. However, because fusion was observed in only 1 of 16 induced limbs analyzed, and because ectopic limbs induced at the forearm (zeugopod) level did not exhibit such fusion (n=1/6 for stylopod-level inductions; n=0/10 for zeugopod-level inductions), we believe that our main conclusion remains valid. Because fusion is not a typical outcome, we now present representative non-fusion cases—including zeugopod-origin examples—in the figure (Fig. 6L1, L2), and we report the fusion incidence explicitly in the text (Line 350‒354). We also note in the revised manuscript that stylopod fusion can occur in a minority of cases (Line 347‒349).

      Figure 7 nicely summarizes their findings and model for patterning.

      We thank the reviewer for this positive comment.

      The table is cut off in the PDF, so it cannot be evaluated at this time.

      In our copy of the PDF, the table appears in full, so this may have been a formatting issue. We have carefully checked the file and ensured that the table is completely included in the revised submission.

      There is a supplemental figure that doesn't seem to be referenced in the text.

      The supplemental figure (Fig. S1 of the original manuscript) is referenced in the text, but it may have been overlooked. To improve clarity, we have expanded the description in the manuscript so that the supplemental figure is more clearly referenced (Line 285‒291).

      (3) Materials and Methods:

      No power analysis was performed to calculate sample group sizes. The authors have used these experimental techniques in the past and could have easily used past data to inform these calculations.

      We thank the reviewer for this important comment. We did not include a power analysis in the manuscript because this was the first time we compared Shh and other gene expression levels among ALM blastemas of different positional origins using RT-qPCR in our experimental system. As we did not have prior knowledge of the expected variability under these specific conditions, it was difficult to predetermine appropriate sample sizes.

      Reviewer #3 (Recommendations for the authors):

      General:

      Congratulations - I found this an elegant and easy-to-read study with significant implications for the field! If possible, I would urge you to consider adding some more characterisation of Wnt10b and Fgf2- which cell types are they expressed in? If you can link your mechanisms to normal limb regeneration too (i.e., regenerating blastema, not ALM), this would significantly elevate the interest in your study.

      We sincerely thank the reviewer for these encouraging comments. As also noted in our response to the editor’s comment, we have analyzed the expression patterns of Wnt10b and Fgf2 in regular blastemas (Line 294‒306). Although clear specific expression patterns along dorsoventral axis were not detected by ISH, likely due to technical limitations of sensitivity, RT-qPCR revealed significantly higher expression levels of Wnt10b in the dorsal half and Fgf2 in the ventral half of a regular blastema (Fig. S5). In addition, we analyzed published single-cell RNA-seq data (7 dpa blastema, Li et al., 2021) (Line 307‒321). As a result, Fgf2 expression was observed in the mesenchymal clusters, whereasWnt10b expression was observed in both mesenchymal and epithelial clusters (Fig. S6). However, because only a small fraction of cells expressed Wnt10b, the principal cellular source of WNT10B protein remains unclear. Therefore, defining the precise spatial patterns of Wnt10b and Fgf2 in regular regeneration will be an important goal for future work.

      Data availability:

      I assume that the RNA-sequencing data will be deposited at a public repository.

      RNA-seq FASTQ files have been deposited in the DNA Data Bank of Japan (DDBJ; https://www.ddbj.nig.ac.jp/) under BioProject accession PRJDB38065. We have added a Data availability section to the revised manuscript.

      References

      Castilla-Ibeas, A., Zdral, S., Oberg, K. C., & Ros, M. A. (2024). The limb dorsoventral axis: Lmx1b’s role in development, pathology, evolution, and regeneration. Developmental Dynamics, 253(9), 798–814. https://doi.org/10.1002/dvdy.695

      Johnson, G. L., Glasser, M. B., Charles, J. F., Duryea, J., & Lehoczky, J. A. (2022). En1 and Lmx1b do not recapitulate embryonic dorsal-ventral limb patterning functions during mouse digit tip regeneration. Cell Reports, 41(8), 111701. https://doi.org/10.1016/j.celrep.2022.111701

      Stocum, D. (2017). Mechanisms of urodele limb regeneration. Regeneration, 4. https://doi.org/10.1002/reg2.92

      Tank, P. W., & Holder, N. (1978). The effect of healing time on the proximodistal organization of double-half forelimb regenerates in the axolotl, Ambystoma mexicanum. Developmental Biology, 66(1), 72–85. https://doi.org/10.1016/0012-1606(78)90274-9

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Because the "source" and "target" tasks are merely parameter variations of the same paradigm, it is unclear whether EIDT achieves true crosstask transfer. The manuscript provides no measure of how consistent each participant's behaviour is across these variants (e.g., two- vs threestep MDP; easy vs difficult MNIST). Without this measure, the transfer results are hard to interpret. In fact, Figure 5 shows a notable drop in accuracy when transferring between the easy and difficult MNIST conditions, compared to transfers between accuracy-focused and speedfocused conditions. Does this discrepancy simply reflect larger withinparticipant behavioural differences between the easy and difficult settings? A direct analysis of intra-individual similarity for each task pair and how that similarity is related to EIDT's transfer performance is needed.

      Thank you for your insightful comment. We agree that the tasks used in our study are variations of the same paradigm. Accordingly, we have revised the manuscript to consistently frame our findings as demonstrating individuality transfer "across task conditions" rather than "across distinct tasks."

      In response to your suggestion, we have conducted a new analysis to directly investigate the relationship between individual behavioural patterns and transfer performance. As show in the new Figures 4, 11, S8, and S9, we found a clear relationship between the distance in the space of individual latent representation (called individuality index in the previous manuscript) and prediction performance. Specifically, prediction accuracy for a given individual's behaviour degrades as the latent representation of the model's source individual becomes more distant. This result directly demonstrates that our framework captures meaningful individual differences that are predictive of transfer performance across conditions.

      We have also expanded the Discussion (Lines 332--343) to address the potential for applying this framework to more structurally distinct tasks, hypothesizing that this would rely on shared underlying cognitive functions.

      Related to the previous comment, the individuality index is central to the framework, yet remains hard to interpret. It shows much greater within-participant variability in the MNIST experiment (Figure S1) than in the MDP experiment (Figure 3). Is such a difference meaningful? It is hard to know whether it reflects noisier data, greater behavioural flexibility, or limitations of the model.

      Thank you for raising this important point about interpretability. To enhance the interpretability of the individual latent representation, we have added a new analysis for the MDP task (see Figures 6 and S4). By applying our trained encoder to data from simulated Q-learning agents with known parameters, we demonstrate that the dimensions of the latent space systematically map onto the agents' underlying cognitive parameters (learning rate and inverse temperature). This analysis provides a clearer interpretation by linking our model's data-driven representation to established theoretical constructs.

      Regarding the greater within-participant variability observed in the MNIST task (visualized now in Figure S7), this could be attributed to several factors, such as greater behavioural flexibility in the perceptual task. However, disentangling these potential factors is complex and falls outside the primary scope of the current study, which prioritizes demonstrating robust prediction accuracy across different task conditions.

      The authors suggests that the model's ability to generalize to new participants "likely relies on the fact that individuality indices form clusters and individuals similar to new participants exist in the training participant pool". It would be helpful to directly test this hypothesis by quantifying the similarity (or distance) of each test participant's individuality index to the individuals or identified clusters within the training set, and assessing whether greater similarity (or closer proximity) to the clusters in the training set is associated with higher prediction accuracy for those individuals in the test set.

      Thank you for this excellent suggestion. We have performed the analysis you proposed to directly test this hypothesis. Our new results, presented in Figures 4, 11, S5, S8, and S9, quantify the distance between the latent representation of a test participant and that of the source participant used to generate the prediction model.

      The results show a significant negative correlation: prediction accuracy consistently decreases as the distance in the latent space increases. This confirms that generalization performance is directly tied to the similarity of behavioural patterns as captured by our latent representation, strongly supporting our hypothesis.

      Reviewer #2 (Public review):

      The MNIST SX baseline appears weak. RTNet isn't directly comparable in structure or training. A stronger baseline would involve training the GRU directly on the task without using the individuality index-e.g., by fixing the decoder head. This would provide a clearer picture of what the index contributes.

      We agree that a more direct baseline is crucial for evaluating the contribution of our transfer mechanism. For the Within-Condition Prediction scenario, the comparison with RTNet was intended only to validate that our task solver architecture could achieve average humanlevel task performance (Figure 7).

      For the critical Cross-Condition Transfer scenario, we have now implemented a stronger and more appropriate baseline, which we call ``task solver (source).'' This model has the same architecture as our EIDT task solver but is trained directly on the source task data of the specific test participant. As shown in revised Figure 9, our EIDT framework significantly outperforms this direct-training approach, clearly demonstrating the benefit of the individuality transfer mechanism.

      Although the focus is on prediction, the framework could offer more insight into how behaviour in one task generalizes to another. For example, simulating predicted behaviours while varying the individuality index might help reveal what behavioural traits it encodes.

      Thank you for this valuable suggestion. To provide more insight into the encoded behavioural traits, we have conducted a new analysis linking the individual latent representation to a theoretical cognitive model. As detailed in the revised manuscript (Figures 6 and S4), we applied our encoder to simulated data from Q-learning agents with varying parameters. The results show a systematic relationship between the latent space coordinates and the agents' learning rates and inverse temperatures, providing a clearer interpretation of what the representation captures.

      It's not clear whether the model can reproduce human behaviour when acting on-policy. Simulating behaviour using the trained task solver and comparing it with actual participant data would help assess how well the model captures individual decision tendencies.

      We have added the suggested on-policy evaluation (Lines 195--207). In the revised manuscript (Figure 5), we present results from simulations where the trained task solvers performed the MDP task. We compared their performance (total reward and rate of the highly-rewarding action selected) against their corresponding human participants. The strong correlations observed demonstrate that our model successfully captures and reproduces individual-specific behavioural tendencies in an onpolicy setting.

      Figures 3 and S1 aim to show that individuality indices from the same participant are closer together than those from different participants. However, this isn't fully convincing from the visualizations alone. Including a quantitative presentation would help support the claim.

      We agree that the original visualizations of inter- and intraparticipant distances was not sufficiently convincing. We have therefore removed that analysis. In its place, we have introduced a more direct and quantitative analysis that explicitly links the individual latent representation to prediction performance (see Figures 4, 11, S5, S8, and S9). This new analysis demonstrates that prediction error for an individual is a function of distance in the latent space, providing stronger evidence that the representation captures meaningful, individual-specific information.

      The transfer scenarios are often between very similar task conditions (e.g., different versions of MNIST or two-step vs three-step MDP). This limits the strength of the generalization claims. In particular, the effects in the MNIST experiment appear relatively modest, and the transfer is between experimental conditions within the same perceptual task. To better support the idea of generalizing behavioural traits across tasks, it would be valuable to include transfers across more structurally distinct tasks.

      We agree with this limitation and have revised the manuscript to be more precise. We now frame our contribution as "individuality transfer across task conditions" rather than "across tasks" to accurately reflect the scope of our experiments. We have also expanded the Discussion section (Line 332-343) to address the potential and challenges of applying this framework to more structurally distinct tasks, noting that it would likely depend on shared underlying cognitive functions.

      For both experiments, it would help to show basic summaries of participants' behavioural performance. For example, in the MDP task, first-stage choice proportions based on transition types are commonly reported. These kinds of benchmarks provide useful context.

      We have added behavioral performance summaries as requested. For the MDP task, Figure 5 now compares the total reward and rate of highlyrewarding action selected between humans and our model. For the MNIST task, Figure 7 shows the rate of correct responses for humans, RTNet, and our task solver across all conditions. These additions provide better context for the model's performance.

      For the MDP task, consider reporting the number or proportion of correct choices in addition to negative log-likelihood. This would make the results more interpretable.

      Thank you for the suggestion. To make the results more interpretable, we have added a new prediction performance metric: the rate for behaviour matched. This metric measures the proportion of trials where the model's predicted action matches the human's actual choice. This is now included alongside the negative log-likelihood in Figures 2, 3, 4, 8, 9, and 11.

      In Figure 5, what is the difference between the "% correct" and "% match to behaviour"? If so, it would help to clarify the distinction in the text or figure captions.

      We have clarified these terms in the revised manuscript. As defined in the Result section (Lines 116--122, 231), "%correct" (now "rate of correct responses") is a measure of task performance, whereas "%match to behaviour" (now "rate for behaviour matched") is a measure of prediction accuracy.

      For the cognitive model, it would be useful to report the fitted parameters (e.g., learning rate, inverse temperature) per individual. This can offer insight into what kinds of behavioural variability the individual latent representation might be capturing.

      We have added histograms of the fitted Q-learning parameters for the human participants in Supplementary Materials (Figure S1). This analysis revealed which parameters varied most across the population and directly informed the design of our subsequent simulation study with Q-learning agents (see response to Comment 2-2), where we linked these parameters to the individual latent representation (Lines 208--223).

      A few of the terms and labels in the paper could be made more intuitive. For example, the name "individuality index" might give the impression of a scalar value rather than a latent vector, and the labels "SX" and "SY" are somewhat arbitrary. You might consider whether clearer or more descriptive alternatives would help readers follow the paper more easily.

      We have adopted the suggested changes for clarity.

      "Individuality index" has been changed to "individual latent representation".

      "Situation SX" and "Situation SY" have been renamed to the more descriptive "Within-Condition Prediction" and "Cross-Condition Transfer", respectively.

      We have also added a table in Figure 7 to clarify the MNIST condition acronyms (EA/ES/DA/DS).

      Please consider including training and validation curves for your models. These would help readers assess convergence, overfitting, and general training stability, especially given the complexity of the encoder-decoder architecture.

      Training and validation curves for both the MDP and MNIST tasks have been added to Supplementary Materials (Figure S2 and S6) to show model convergence and stability.

      Reviewer #3 (Public review):

      To demonstrate the effectiveness of the approach, the authors compare a Q-learning cognitive model (for the MDP task) and RTNet (for the MNIST task) against the proposed framework. However, as I understand it, neither the cognitive model nor RTNet is designed to fit or account for individual variability. If that is the case, it is unclear why these models serve as appropriate baselines. Isn't it expected that a model explicitly fitted to individual data would outperform models that do not? If so, does the observed superiority of the proposed framework simply reflect the unsurprising benefit of fitting individual variability? I think the authors should either clarify why these models constitute fair control or validate the proposed approach against stronger and more appropriate baselines.

      Thank you for raising this critical point. We wish to clarify the nature of our baselines:

      For the MDP task, the cognitive model baseline was indeed designed to account for individual variability. We estimated its parameters (e.g., learning rate) from each individual's source task behaviour and then used those specific parameters to predict their behaviour in the target task. This makes it a direct, parameter-based transfer model and thus a fair and appropriate baseline for individuality transfer.

      For the MNIST task, we agree that the RTNet baseline was insufficient for evaluating individual-level transfer in the "Cross-Condition Transfer" scenario. We have now introduced a much stronger baseline, the "task solver (source)," which is trained specifically on the source task data of each test participant. Our results (Figure 9) show that the EIDT framework significantly outperforms this more appropriate, individualized baseline, highlighting the value of our transfer method over direct, within-condition fitting.

      It's not very clear in the results section what it means by having a shorter within-individual distance than between-individual distances. Related to the comment above, is there any control analysis performed for this? Also, this analysis appears to have nothing to do with predicting individual behavior. Is this evidence toward successfully parameterizing individual differences? Could this be task-dependent, especially since the transfer is evaluated on exceedingly similar tasks in both experiments? I think a bit more discussion of the motivation and implications of these results will help the reader in making sense of this analysis.

      We agree that the previous analysis on inter- and intra-participant distances was not sufficiently clear or directly linked to the model's predictive power. We have removed this analysis from the manuscript. In its place, we have introduced a new, more direct analysis (Figures 4, 11, S5, S8, and S9) that demonstrates a quantitative relationship between the distance in the latent space and prediction accuracy. This new analysis shows that prediction error for an individual increases as a function of this distance, providing much stronger and clearer evidence that our framework successfully parameterizes meaningful individual differences.

      The authors have to better define what exactly he meant by transferring across different "tasks" and testing the framework in "more distinctive tasks". All presented evidence, taken at face value, demonstrated transferring across different "conditions" of the same task within the same experiment. It is unclear to me how generalizable the framework will be when applied to different tasks.

      Conceptually, it is also unclear to me how plausible it is that the framework could generalize across tasks spanning multiple cognitive domains (if that's what is meant by more distinctive). For instance, how can an individual's task performance on a Posner task predict task performance on the Cambridge face memory test? Which part of the framework could have enabled such a cross-domain prediction of task performance? I think these have to be at least discussed to some extent, since without it the future direction is meaningless.

      We agree with your assessment and have corrected our terminology throughout the manuscript. We now consistently refer to the transfer as being "across task conditions" to accurately describe the scope of our findings.

      We have also expanded our Discussion (Line 332-343) to address the important conceptual point about cross-domain transfer. We hypothesize that such transfer would be possible if the tasks, even if structurally different, rely on partially shared underlying cognitive functions (e.g., working memory). In such a scenario, the individual latent representation would capture an individual's specific characteristics related to that shared function, enabling transfer. Conversely, we state that transfer between tasks with no shared cognitive basis would not be expected to succeed with our current framework.

      How is the negative log-likelihood, which seems to be the main metric for comparison, computed? Is this based on trial-by-trial response prediction or probability of responses, as what usually performed in cognitive modelling?

      The negative log-likelihood is computed on a trial-by-trial basis. It is based on the probability the model assigned to the specific action that the human participant actually took on that trial. This calculation is applied consistently across all models (cognitive models, RTNet, and EIDT). We have added sentences to the Results section to clarify this point (Lines 116--122).

      None of the presented evidence is cross-validated. The authors should consider performing K-fold cross-validation on the train, test, and evaluation split of subjects to ensure robustness of the findings.

      All prediction performance results reported in the revised manuscript are now based on a rigorous leave-one-participant-out cross-validation procedure to ensure the robustness of our findings. We have updated the

      Methods section to reflect this (Lines 127--129 and 229).

      For some purely illustrative visualizations (e.g., plotting the entire latent space in Figures S3 and S7), we used a model trained on all participants' data to provide a single, representative example and avoid clutter. We have explicitly noted this in the relevant figure captions.

      The authors excluded 25 subjects (20% of the data) for different reasons. This is a substantial proportion, especially by the standards of what is typically observed in behavioral experiments. The authors should provide a clear justification for these exclusion criteria and, if possible, cite relevant studies that support the use of such stringent thresholds.

      We acknowledge the concern regarding the exclusion rate. The previous criteria were indeed empirical. We have now implemented more systematic exclusion procedure based on the interquartile range of performance metrics, which is detailed in Section 4.2.2 (Lines 489--498). This revised, objective criterion resulted in the exclusion of 42 participants (34% of the initial sample). While this rate is high, we attribute it to the online nature of the data collection, where participant engagement can be more variable. We believe applying these strict criteria was necessary to ensure the quality and reliability of the behavioural data used for modeling.

      The authors should do a better job of creating the figures and writing the figure captions. It is unclear which specific claim the authors are addressing with the figure. For example, what is the key message of Figure 2C regarding transfer within and across participants? Why are the stats presentation different between the Cognitive model and the EIDT framework plots? In Figure 3, it's unclear what these dots and clusters represent and how they support the authors' claim that the same individual forms clusters. And isn't this experiment have 98 subjects after exclusion, this plot has way less than 98 dots as far as I can tell. Furthermore, I find Figure 5 particularly confusing, as the underlying claim it is meant to illustrate is unclear. Clearer figures and more informative captions are needed to guide the reader effectively.

      We agree that several figures and analyses in the original manuscript were unclear, and we have thoroughly revised our figures and their captions to improve clarity.

      The confusing analysis in the old Figures 2C and 5 (Original/Others comparison) have been completely removed. The unclear visualization of the latent space for the test pool (old Figure 3 showing representations only from test participants) has also been removed to avoid confusion. For visualization of the overall latent space, we now use models trained on all data (Figures S3 and S7) and have clarified this in the captions. In place of these removed analyses, we have introduced a new, more intuitive "cross-individual" analysis (presented in Figures 4, 11, S5, S8, and S9). As explained in the new, more detailed captions, this analysis directly plots prediction performance as a function of the distance in latent space, providing a much clearer demonstration of how the latent representation relates to predictive accuracy.

      I also find the writing somewhat difficult to follow. The subheadings are confusing, and it's often unclear which specific claim the authors are addressing. The presentation of results feels disorganized, making it hard to trace the evidence supporting each claim. Also, the excessive use of acronyms (e.g., SX, SY, CG, EA, ES, DA, DS) makes the text harder to parse. I recommend restructuring the results section to be clearer and significantly reducing the use of unnecessary acronyms.

      Thank you for this feedback. We have made significant revisions to improve the clarity and organization of the manuscript. We have renamed confusing acronyms: "Situation SX" is now "Within- Condition Prediction," and "Situation SY" is now "Cross-Condition Transfer." We also added a table to clarify the MNIST condition acronyms (EA/ES/DA/DS) in Figure 7.

      The Results section has been substantially restructured with clearer subheadings.

    1. Author response:

      Reviewer #1

      (1) Mechanistic insight into how Hsp70 but not Hsc70 increase PL-SF FL tau aggregation/pathology is missing. This is despite both chaperones binding to PL-SF FL tau. What species of tau does Hsp70 bind, and what cofactors are important in this process?

      We agree that explaining why Hsp70, but not Hsc70, promotes tau aggregation would strengthen the study. Although both chaperones bind tau, they diverge slightly in 1) protein sequence, 2) biochemical activity, and 3) co-chaperone engagement.

      Sequence: Hsp70 has an extra cysteine residue (Cys306) that is highly reactive to oxidation and a glycine residue that is critical for cysteine oxidation (Gly557). Both residues are specific to Hsp70 (not present in Hsc70) and may alter Hsp70 conformation or client handling (Hong et al., 2022).

      Biochemical activity: Prior studies indicate that Hsp70’s ATPase domain (NBD) is critical for tau interactions (Jinwal et al., 2009; Fontaine et al., 2015; Young et al., 2016) and can be disrupted with point mutations including K71E and E175S for ATPase and A406G/V438G for substrate binding (Fontaine et al., 2015).

      Co-chaperone engagement: Hsp70 recruits the co-chaperone and E3 ubiquitin ligase CHIP/Stub1 more strongly than Hsc70, suggesting co-chaperone engagement could lead to differences in tau processing (Jinwal et al., 2013).

      To directly test how the two closely related chaperones could differentially impact tau, we plan to perform the following experiments:

      (a) We will mutate residues responsible for cysteine reactivity in Hsp70 including the cysteine itself (Cys306) and the critical glycine that facilitates cysteine reactivity (Gly557). These residues will be deleted from Hsp70 or alternatively inserted into Hsc70 to determine whether cysteine reactivity is the reason for Hsp70’s ability to drive tau aggregation.

      (b) We will generate Hsp70 mutants lacking ATPase- or substrate-binding mutants to determine which Hsp70 domains are responsible for driving tau aggregation.

      (c) We will perform seeding assays in stable tau-expressing cell lines to determine whether Hsp70/Hsc70 overexpression or depletion alters seeded tau aggregation.

      (d) We will perform confocal microscopy to determine the extent of co-localization of Hsp70 or Hsc70 with phospho-tau, oligomeric tau, or Thioflavin-S (ThioS) to identify which tau species are engaged by Hsp70/Hsc70.

      (e) We will perform immunoprecipitation pull-downs followed by mass spectrometry to globally identify any relevant Hsp70/Hsc70 interacting factors that might account for the differences in tau aggregation.

      (2) The study relies heavily on densitometry of bands to draw conclusions; in several instances, the blots are overexposed to accurately quantify the signal.

      All immunoblots were acquired as 16-bit TIFFs with exposure settings chosen to prevent pixel saturation, and quantification was performed on raw, unsaturated images. Brightness and contrast adjustments were applied only for visualization and did not alter pixel values used for analysis. All quantified bands fell within the linear range of the detector, with one exception in Figure 7B, which we removed from quantification. We will add both low- and high-exposure versions of immunoblots to the revised figures to demonstrate signal linearity and dynamic range.

      Reviewer #2

      (1) Although the PL-SF model can accelerate tau aggregation, it is crucial to determine whether this aligns with the temporal progression and spatial distribution of tau pathology in the brains of patients with tauopathies.

      No single tauopathy model fully recapitulates the temporal and spatial progression of human tauopathies. The PL-SF system is not intended to model the disease course. Rather, it is an excellent model for mechanistic studies of mature tau aggregation, which is otherwise challenging to study. We note that prior studies showed that PL-SF tau expression in transgenic mice (Xia et al., 2022 and Smith et al., 2025) and rhesus monkeys (Beckman et al., 2021) led to prion-like tau seeding and aggregation in hippocampal and cortical regions. Indeed, the spatial and temporal tau aggregation patterns aligned with features of human tauopathies. So far, these findings all support PL-SF as a valid accelerated model of tauopathy than can be used to interrogate pathogenic mechanisms that impact tau processing, degradation, and/or aggregation.

      (2) The authors did not elucidate the specific molecular mechanism by which Hsp70 promotes tau aggregation.

      We agree that a deeper understanding of the molecular mechanism is needed. The revision experiments outlined above (Reviewer #1, point #1) will define how Hsp70 promotes tau aggregation by testing sequence contributions, dissecting ATPase and substrate-binding domain requirements, and mapping Hsp70/Hsc70 interactors to directly address this mechanistic question.

      (3) Some figures in this study show large error bars in the quantitative data (some statistical analysis figures, MEA recordings, etc.), indicating significant inter-sample variability. It is recommended to label individual data points in all quantitative figures and clearly indicate them in figure legends.

      We acknowledge the inter-sample variability in some of the quantitative datasets. This level of variability can occur in primary neuronal cultures (e.g., MEA recordings) that are sensitive to growth and surface adhesion conditions, leading to many technical considerations. To improve transparency and interpretation, we will revise all quantitative figures to display individual data points overlaid on summary statistics and will update figure legends to clearly indicate sample sizes and statistical tests used.

      References

      Hong Z, Gong W, Yang J, Li S, Liu Z, Perrett S, Zhang H. Exploration of the cysteine reactivity of human inducible Hsp70 and cognate Hsc70. J Biol Chem. 2023 Jan;299(1):102723. doi: 10.1016/j.jbc.2022.102723. Epub 2022 Nov 19. PMID: 36410435; PMCID: PMC9800336.

      Jinwal UK, Miyata Y, Koren J 3rd, Jones JR, Trotter JH, Chang L, O'Leary J, Morgan D, Lee DC, Shults CL, Rousaki A, Weeber EJ, Zuiderweg ER, Gestwicki JE, Dickey CA. Chemical manipulation of hsp70 ATPase activity regulates tau stability. J Neurosci. 2009 Sep 30;29(39):12079-88. doi: 10.1523/JNEUROSCI.3345-09.2009. PMID: 19793966; PMCID: PMC2775811.

      Fontaine SN, Rauch JN, Nordhues BA, Assimon VA, Stothert AR, Jinwal UK, Sabbagh JJ, Chang L, Stevens SM Jr, Zuiderweg ER, Gestwicki JE, Dickey CA. Isoform-selective Genetic Inhibition of Constitutive Cytosolic Hsp70 Activity Promotes Client Tau Degradation Using an Altered Co-chaperone Complement. J Biol Chem. 2015 May 22;290(21):13115-27. doi: 10.1074/jbc.M115.637595. Epub 2015 Apr 11. PMID: 25864199; PMCID: PMC4505567

      Young ZT, Rauch JN, Assimon VA, Jinwal UK, Ahn M, Li X, Dunyak BM, Ahmad A, Carlson G, Srinivasan SR, Zuiderweg ERP, Dickey CA, Gestwicki JE. Stabilizing the Hsp70‑Tau Complex Promotes Turnover in Models of Tauopathy. Cell Chem Biol. 2016 Aug 4;23(8):992–1001. doi:10.1016/j.chembiol.2016.04.014.

      Jinwal UK, Akoury E, Abisambra JF, O'Leary JC 3rd, Thompson AD, Blair LJ, Jin Y, Bacon J, Nordhues BA, Cockman M, Zhang J, Li P, Zhang B, Borysov S, Uversky VN, Biernat J, Mandelkow E, Gestwicki JE, Zweckstetter M, Dickey CA. Imbalance of Hsp70 family variants fosters tau accumulation. FASEB J. 2013 Apr;27(4):1450-9. doi: 10.1096/fj.12-220889. Epub 2012 Dec 27. PMID: 23271055; PMCID: PMC3606536.

      Xia, Y., Prokop, S., Bell, B.M. et al. Pathogenic tau recruits wild-type tau into brain inclusions and induces gut degeneration in transgenic SPAM mice. Commun Biol 5, 446 (2022). https://doi.org/10.1038/s42003-022-03373-1.

      Smith ED, Paterno G, Bell BM, Gorion KM, Prokop S, Giasson BI. Tau from SPAM Transgenic Mice Exhibit Potent Strain-Specific Prion-Like Seeding Properties Characteristic of Human Neurodegenerative Diseases. Neuromolecular Med. 2025 May 30;27(1):44. doi: 10.1007/s12017-025-08850-4. PMID: 40447946; PMCID: PMC12125038.

      Beckman D, Chakrabarty P, Ott S, Dao A, Zhou E, Janssen WG, Donis-Cox K, Muller S, Kordower JH, Morrison JH. A novel tau-based rhesus monkey model of Alzheimer's pathogenesis. Alzheimers Dement. 2021 Jun;17(6):933-945. doi: 10.1002/alz.12318. Epub 2021 Mar 18. PMID: 33734581; PMCID: PMC8252011.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      This study presents convincing findings that oligodendrocytes play a regulatory role in spontaneous neural activity synchronisation during early postnatal development, with implications for adult brain function. Utilising targeted genetic approaches, the authors demonstrate how oligodendrocyte depletion impacts Purkinje cell activity and behaviours dependent on cerebellar function. Delayed myelination during critical developmental windows is linked to persistent alterations in neural circuit function, underscoring the lasting impact of oligodendrocyte activity. 

      Strengths: 

      (1) The research leverages the anatomically distinct olivocerebellar circuit, a well-characterized system with known developmental timelines and inputs, strengthening the link between oligodendrocyte function and neural synchronization. 

      (2) Functional assessments, supported by behavioral tests, validate the findings of in vivo calcium imaging, enhancing the study's credibility. 

      (3) Extending the study to assess the long-term effects of early-life myelination disruptions adds depth to the implications for both circuit function and behavior.

      We appreciate these positive evaluation.

      Weaknesses: 

      (1) The study would benefit from a closer analysis of myelination during the periods when synchrony is recorded. Direct correlations between myelination and synchronized activity would substantiate the mechanistic link and clarify if observed behavioral deficits stem from altered myelination timing. 

      We appreciate the reviewer’s thoughtful suggestion and have expanded the manuscript to clarify how oligodendrocyte maturation relates to the development of Purkinje-cell synchrony. The developmental trajectory of Purkinje-cell synchrony has already been comprehensively characterized by Good et al. (2017, Cell Reports 21: 2066–2073): synchrony drops from a high level at P3–P5 to adult-like values by P8. We found that the myelination in the cerebellum starts to appear from P5-P7 (Figure S1A, B), indicating that the timing of Purkinje cell desynchronization coincides with the initial appearance of oligodendrocytes and myelin in the cerebellum. To determine whether myelin growth could nevertheless modulate this process, we quantified ASPA-positive oligodendrocyte density and MBP-positive bundle thickness and area at P10, P14, P21 and adulthood (Fig. 1J, K, Fig. S1E). Both metrics increase monotonically and clearly lag behind the rapid drop in synchrony, indicating that myelination could be not the primary trigger for the desynchronization. When oligodendrocytes were ablated during the second postnatal week, the synchrony was reduced (new Fig. 2). Thus, once myelination is underway, oligodendrocytes become critical for maintaining the synchrony, acting not as the initiators but as the stabilizers and refiners of the mature network state.

      We have added the new subsection in discussion (lines 451–467) now in which we propose a two-phase model. Phase I (P3–P8): High early synchrony is generated by non-myelin mechanisms (e.g. transient gap junctions, shared climbing-fiber input). Phase II (P8-). As oligodendrocytes proliferate and ensheath axons, they fine-tune conduction velocity and stabilize the mature, low-synchrony network state.

      We believe these additions fully address the reviewer’s concerns.

      (2) Although the study focuses on Purkinje cells in the cerebellum, neural synchrony typically involves cross-regional interactions. Expanding the discussion on how localized Purkinje synchrony affects broader behaviors - such as anxiety, motor function, and sociality - would enhance the findings' functional significance.

      We appreciate the reviewer’s helpful suggestion and have expanded the Discussion (lines 543–564) to clarify how localized Purkinje-cell synchrony can influence broader behavioral domains. In the revised text we note that changes in PC synchrony propagate  into thalamic, prefrontal, limbic, and parietal targets, thereby impacting distributed networks involved in motor coordination, affect, and social interaction. Our optogenetic rescue experiments further support this framework, as transient resynchronization of PCs normalized sociability and motor coordination while leaving anxiety-like behavior impaired. This dissociation highlights that different behavioral domains rely to varying degrees on precise cerebellar synchrony and underscores how even localized perturbations in Purkinje timing can acquire system-level significance.

      (3) The authors discuss the possibility of oligodendrocyte-mediated synapse elimination as a possible mechanism behind their findings, drawing from relevant recent literature on oligodendrocyte precursor cells. However, there are no data presented supporting this assumption. The authors should explain why they think the mechanism behind their observation extends beyond the contribution of myelination or remove this point from the discussion entirely.

      We thank the reviewer for pointing out that our original discussion of oligodendrocyte-mediated synapse elimination was not directly supported by data in the present manuscript. Because we are actively analyzing this question in a separate, follow-up study, we have deleted the speculative passage to keep the current paper focused on the demonstrated, myelination-dependent effects. We believe this change sharpens the mechanistic narrative and fully addresses the reviewer’s concern.

      (4) It would be valuable to investigate the secondary effects of oligodendrocyte depletion on other glial cells, particularly astrocytes or microglia, which could influence long-term behavioral outcomes. Identifying whether the lasting effects stem from developmental oligodendrocyte function alone or also involve myelination could deepen the study's insights. 

      We thank the reviewer for raising this point and have performed the requested analyses. Using IBA1 immunostaining for microglia and S100b for Bergmann glia, we quantified cell density and these marker signal intensity at P14 and P21. Neither microglial or Bergmann-glial differed between control and oligodendrocyte-ablated mice at either time‐point (new Figure S2). These results indicate that the behavioral phenotypes we report are unlikely to arise from secondary activation or loss of other glial populations.

      We now added results (lines 275–286) and also discuss myelination and other oligodendrocyte function (lines 443–450). It remains difficult to disentangle conduction-related effects from myelination-independent trophic roles of oligodendrocytes. We therefore note explicitly that future work employing stage-specific genetic tools or acute metabolic manipulations will be required to parse these contributions more definitively.

      (5) The authors should explore the use of different methods to disturb myelin production for a longer time, in order to further determine if the observed effects are transient or if they could have longer-lasting effects.

      We agree that distinguishing transient from enduring effects is critical. Importantly, our original submission already included data demonstrating a persistent deficit of PC population synchrony (Fig. 4, previous Fig. 3): (i) at P14—the early age after oligodendrocyte ablation—population synchrony is reduced, and (ii) the same deficit is still present in adults (P60–P70) despite full recovery of ASPA-positive cell density and MBP-area and -thickness (Fig. 2H-K, Fig. S1E, and Fig. 4). We also performed the ablation of oligodendrocytes after the third postnatal week. Despite a similar acute drop in ASPA-positive cells, neither population synchrony nor anxiety-, motor-, or social behaviors differed from littermate controls. Thus, extending myelin disruption beyond the developmental window does not exacerbate or prolong the phenotype, whereas a short perturbation within that window leaves a permanent timing defect. These findings strengthen our conclusion that it is the developmental oligodendrocyte/myelination program itself—rather than ongoing adult myelin production—that is essential for establishing stable network synchrony. We now highlight this point explicitly in the revised Discussion (lines 507–522).

      (6) Throughout the paper, there are concerns about statistical analyses, particularly on the use of the Mann-Whitney test or using fields of view as biological replicates.

      We appreciate the reviewer’s guidance on appropriate statistical treatment. To address these concerns we have re-analyzed all datasets that contained multiple measurements per animal (e.g., fields of view, lobules, or trials) using nested statistics with animal as the higher-order unit. Specifically, we applied a two-level nested ANOVA when more than two groups were compared and a nested t-test when two conditions were present. The re-analysis confirmed all original conclusions. Because the nested models yielded comparable effect sizes to the Mann–Whitney tests, we have retained the mean ± SEM for ease of comparison with prior literature but now also report all values for each mouse in Table 1. In cases where a single measurement per mouse was compared between two groups, we used the Mann–Whitney test and present the results in the graphs as median values.

      Major

      (1) The authors present compelling evidence that early loss of myelination disrupts synchronous firing prematurely. However, synchronous neuronal firing does not equate to circuit synchronization. It is improbable that myelination directly generates synchronous firing in Purkinje cells (PCs). For instance, Foran et al. (1992) identified that cerebellar myelination begins around postnatal day 6 (P6), while Good et al. (2017) recorded a developmental decline in PC activity correlation from P5-P11. To clarify myelin's role, we recommend detailed myelin imaging through light microscopy (MBP staining at higher magnification) to assess the extent of myelin removal accurately. Myelin sheaths, as shown by Snaidero et al. (2020), can persist after oligodendrocyte (OL) death, particularly following DTA induction (Pohl et al. 2011). Quantification of MBP+ area, rather than mean MBP intensity, is necessary to accurately measure myelin coverage.

      We appreciate the reviewer’s concern that residual sheaths might remain after oligodendrocyte ablation and have therefore re-examined myelin at higher spatial resolution. Then, two independent metrics were extracted: MBP⁺ area fraction in the white matter and MBP⁺ bundle thickness (new Figure 1J, K, and Fig. S1E). We confirm a robust, transient loss of myelin at P10 and P14 as shown by the reduction of MBP⁺ area and MBP⁺ bundle thickness. Both parameters recovered to control values by P21 and adulthood, indicating effective remyelination. These data demonstrate that, in our paradigm, oligodendrocyte ablation is accompanied by substantial sheath loss rather than the persistent myelin reported after acute toxin exposure. We have added them in Result (lines 266–271).

      The results reinforce the view that myelin removal and/or loss of trophic support during a narrow developmental window drive the long-term hyposynchrony and behavioral phenotypes we report. We have added the new subsection in discussion (lines 443–450) now in which we propose a two-phase model. Phase I (P3–P8): High early synchrony is generated by non-myelin mechanisms (e.g. transient gap junctions, shared climbing-fiber input). Phase II (P8-). As oligodendrocytes proliferate and ensheath axons, they fine-tune conduction velocity and stabilize the mature, low-synchrony network state. We believe these additions fully address the reviewer’s concerns.

      (2) Surprisingly, the authors speculate about oligodendrocyte-mediated synaptic pruning without supportive data, shifting the focus away from the potential impact of myelination. Even if OLs perform synaptic pruning, OL depletion would likely maintain synchrony, yet the opposite was observed. Further characterisation of the model and the potential source of the effect is needed. 

      We thank the reviewer for pointing out that our original discussion of oligodendrocyte-mediated synapse elimination was not directly supported by data in the present manuscript. Because we are actively analyzing this question in a separate, follow-up study, we have deleted the speculative passage to keep the current paper focused on the demonstrated, myelination-dependent effects. We believe this change sharpens the mechanistic narrative and fully addresses the reviewer’s concern.

      (3) Improved characterization of the DTA model would add clarity. Although almost all infected cells are reported as OLs, quantification of infected OL-lineage cells (e.g., via Olig2 staining) would verify this. It remains possible that observed activity changes are driven by OL-independent demyelination effects. We suggest cross-staining with Iba1 and GFAP to rule out inflammation or gliosis. 

      We thank the reviewer for this important suggestion and have expanded our histological characterization accordingly. First, to verify that DTA expression is confined to mature oligodendrocytes, we co-stained cerebellar sections collected 7 days after AAV-hMAG-mCherry injection with Olig2 (pan-OL lineage) and ASPA (mature OL marker) as shown in Figure S1C-D. Quantitative analysis revealed that 100 % of mCherry⁺ cells were Olig2⁺/ASPA⁺, whereas mCherry signal was virtually absent in Olig2⁺/ASPA⁻ immature oligodendrocytes. These data confirm that our DTA manipulation targets mature myelinating OLs rather than earlier lineage stages. We have added them in Result (lines 260–262).

      Second, to examine indirect effects mediated by other glia, we performed cross-staining with IBA1 (microglia) and S100β (Bergmann). Cell density and fluorescence intensity for each marker were indistinguishable between control and DTA groups at P14 and P21 (Figure S2A-H). Thus, neither inflammation nor astro-/microgliosis accompanies OL ablation. We have added them in Result (lines 275–286).

      Collectively, these results demonstrate that the observed desynchronization and behavioral phenotypes arise from specific loss of mature oligodendrocytes and their myelin, rather than from off-target viral expression or secondary glial responses.

      (4) The use of an independent model of myelin loss, such as the inducible Myrf knockout mouse with a MAG promoter, to assess if oligodendrocyte loss causes temporary or sustained impacts, employing an extended knockout model like Myrf cKO with MAG-Cre viral methods would be advantageous.

      We agree that distinguishing transient from enduring effects is critical. Importantly, our original submission already included data demonstrating a persistent deficit of PC population synchrony (Fig. 4, previous Fig. 3): (i) at P13-15—the early age after oligodendrocyte ablation—population synchrony is reduced, and (ii) the same deficit is still present in adults (P60–P70) despite full recovery of ASPA-positive cell density and MBP-area and -thickness (Fig. 2H-K, Fig. S1E, and Fig. 4). We also performed the ablation of oligodendrocytes after the third postnatal week. Despite a similar acute drop in ASPA-positive cells, neither population synchrony nor anxiety-, motor-, or social behaviors differed from littermate controls. Thus, extending myelin disruption beyond the developmental window does not exacerbate or prolong the phenotype, whereas a short perturbation within that window leaves a permanent timing defect. These findings strengthen our conclusion that it is the developmental oligodendrocyte/myelination program itself—rather than ongoing adult myelin production—that is essential for establishing stable network synchrony. We now highlight this point explicitly in the revised Discussion (lines 507–522).

      (5) For statistical robustness, the use of non-parametric tests (Mann-Whitney) necessitates reporting the median instead of the mean as the authors do. Furthermore, as repeated measurements within the same animal are not independent, the authors should ideally use nested ANOVA (or nested t-test comparing two conditions) to validate their findings (Aarts et al., Nat. Neuroscience 2014). Alternatively use one-way ANOVA with each animal as a biological replicate, although this means that the distribution in the data sets per animal is lost.

      We appreciate the reviewer’s guidance on appropriate statistical treatment. To address these concerns we have re-analyzed all datasets that contained multiple measurements per animal (e.g., fields of view, lobules, or trials) using nested statistics with animal as the higher-order unit. Specifically, we applied a two-level nested ANOVA when more than two groups were compared and a nested t-test when two conditions were present. The re-analysis confirmed all original conclusions. Because the nested models yielded comparable effect sizes to the Mann–Whitney tests, we have retained the mean ± SEM for ease of comparison with prior literature but now also report all values for each mouse in Table 1. In cases where a single measurement per mouse was compared between two groups, we used the Mann–Whitney test and present the results in the graphs as median values.

      Minor Points 

      (1) In all figures, please specify the ages at which each procedure was conducted, as demonstrated in Figure 2A.

      All main and supplementary figures now specify the exact postnatal age.

      (2) Clarify the selection criteria for regions of interest (ROI) in calcium imaging, and provide representative ROIs.

      We appreciate the reviewer’s guidance. We have clarified that our ROI detection followed the protocol reported by our previous paper (Tanigawa et al., 2024, Communications Biology) (lines 177-178) and representative Purkinje cell ROIs are now shown in Fig. 2B.

      (3) Include data on the proportion of climbing fiber or inferior olive neurons expressing Kir and the total number of neurons transfected, which would help contextualize the observed effects on PC synchronization and its broader implications for cerebellar circuit function.

      We appreciate the reviewer’s guidance. New Fig. 7C summarizes the efficiency of AAV-GFP and AAV-Kir2.1-GFP injections into the inferior olive. Across 4 mice PCs with GFP-labeled CFs was detected in 19.3 ± 11.9 (mean ± S.D.) % for control and 26.2 ± 11.8 (mean ± S.D.) % for Kir2.1 of PCs. These numbers are reported in the Results (lines 373–375).

      (4) Higher magnification images in Figures 1 and S3 would improve visual clarity. 

      We have addressed the request for higher-magnification images in two ways. First, all panels in Figure S3 were placed on a larger canvas. Second, in Figure 1 we adjusted panel sizes to emphasize fine structure: panel 1C already represents an enlargement of the RFP positive cells shown in 1B, and panel 1H and 1J now occupies a wider span so that every ASPA-positive cell body can be distinguished. Should the reviewer still require an even closer view, we have additional ready for upload.

      (5) Consider language editing to enhance overall clarity and readability.

      The entire manuscript was edited to improve flow, consistency, and readability.

      (6) Refine the discussion to align with the presented data.

      We have refined the discussion.

      Thank you once again for your constructive suggestions and comments. We believe these changes have improved the clarity and readability of our manuscript.

      Reviewer #2 (Public review):

      We appreciate Reviewer #2’s positive evaluation of our work and thank him/her for the constructive suggestions and comments. We followed these suggestions and comments and have conducted additional experiments. We have rewritten the manuscript and revised the figures according to the points Reviewer #1 mentioned. Our point-by-point responses to the comments are as follows.

      Summary:

      In this manuscript, the authors use genetic tools to ablate oligodendrocytes in the cerebellum during postnatal development. They show that the oligodendrocyte numbers return to normal post-weaning. Yet, the loss of oligodendrocytes during development seems to result in decreased synchrony of calcium transients in Purkinje neurons across the cerebellum. Further, there were deficits in social behaviors and motor coordination. Finally, they suppress activity in a subset of climbing fibers to show that it results in similar phenotypes in the calcium signaling and behavioral assays. They conclude that the behavioral deficits in the oligodendrocyte ablation experiments must result from loss of synchrony. 

      Strengths:

      Use of genetic tools to induce perturbations in a spatiotemporally specific manner.

      We appreciate these positive evaluation.

      Weaknesses: 

      The main weakness in this manuscript is the lack of a cohesive causal connection between the experimental manipulation performed and the phenotypes observed. Though they have taken great care to induce oligodendrocyte loss specifically in the cerebellum and at specific time windows, the subsequent experiments do not address specific questions regarding the effect of this manipulation.

      Calcium transients in Purkinje neurons are caused to a large extent by climbing fibers, but there is evidence for simple spikes to also underlie the dF/F signatures (Ramirez and Stell, Cell Reports, 2016).

      We thank the reviewer for drawing attention to the work of Ramirez & Stell (2016), which showed that simple-spike bursts can elicit Ca²⁺ rises, but only in the soma and proximal dendrites of adult Purkinje cells. In our study, Regions of Interest were restricted to the dendritic arbor, where SS-evoked signals are essentially undetectable (Ramirez & Stell, 2016), whereas climbing-fiber complex spikes generate large, all-or-none transients (Good et al., 2017). Accordingly, even if a rare SS-driven event reached threshold it would likely fall outside our ROIs.

      In addition, we directly imaged CF population activity by expressing GCaMP7f in inferior-olive neurons. Correlation analysis of CF boutons revealed that DTA ablation lowers CF–CF synchrony at P14 (new Fig. 3A–D). Because CF synchrony is a principal driver of Purkinje-cell co-activation, this reduction provides a mechanistic link between oligodendrocyte loss and the hyposynchrony we observe among Purkinje cells. Consistent with this interpretation, electrophysiological recordings showed that parallel-fiber EPSCs and inhibitory synaptic inputs onto Purkinje cells were unchanged by DTA treatment (Fig. 3E-H) , which makes it less likely that the reduced synchrony simply reflects changes in other excitatory or inhibitory synaptic drive.

      That said, SS-dependent somatic Ca²⁺ signals could still influence downstream plasticity and long-term cerebellar function. In future work we therefore plan to combine somatic imaging with stage-specific oligodendrocyte manipulations to test whether SS-evoked Ca²⁺ dynamics are themselves modulated by oligodendrocyte support. We have added these descriptions in the Results (lines 288–294) and Discussion (lines 423–434).

      Also, it is erroneous to categorize these calcium signals as signatures of "spontaneous activity" of Purkinje neurons as they can have dual origins.

      Thank you for pointing out the potential ambiguity. In the revised manuscript we have clarified how we use the term “spontaneous activity” in the context of our measurements (lines 289-290). Our calcium imaging was restricted to the dendritic arbor of Purkinje cells, where calcium transients are dominated by climbing-fiber (CF) inputs (Ramirez & Stell, 2016; Good et al., 2017). Thus, the synchrony values reported here primarily reflect CF-driven complex spikes rather than mixed signals of dual origin. We have revised the Results section accordingly (lines 289–293) to make this measurement-specific limitation explicit.

      Further, the effect of developmental oligodendrocyte ablation on the cerebellum has been previously reported by Mathis et al., Development, 2003. They report very severe effects such as the loss of molecular layer interneurons, stunted Purkinje neuron dendritic arbors, abnormal foliations, etc. In this context, it is hardly surprising that one would observe a reduction of synchrony in Purkinje neurons (perhaps due to loss of synaptic contacts, not only from CFs but also from granule cells).

      We appreciate the reviewer’s comparison to Mathis et al. (2003). Mathis et al. used MBP–HSV-TK transgenic mice in which systemic FIAU treatment eliminates oligodendrocytes. When ablation began at P1, they observed severe dysmorphology—loss of molecular-layer interneurons, Purkinje-cell (PC) dendritic stunting, and abnormal foliation. Crucially, however, the same study reports that starting the ablation later (FIAU from P6-P20) left cerebellar cyto-architecture entirely normal.

      Our AAV MAG-DTA paradigm resembles this later window. Our temporally restricted DTA protocol produces the same ‘late-onset’ profile—robust yet reversible hypomyelination with no loss of Purkinje cells, interneurons, dendritic length, or synaptic input (new Fig. S1–S2, Fig. 3E-H). The enduring hyposynchrony we report therefore cannot be attributed to the dramatic anatomical defects seen after prenatal ablation, but instead reveals a specific requirement for early-postnatal myelin in stabilizing PC synchrony, especially affecting CF-CF synchrony.

      This clarification shows that we have carefully considered the Mathis model and that our findings not only replicate, but also extend the earlier work. We have added these description in Result (lines 273-286)

      The last experiment with the expression of Kir2.1 in the inferior olive is hardly convincing.

      We appreciate the reviewer’s concern and have reinforced the causal link between Purkinje-cell synchrony and behavior. To test whether restoring PC synchrony is sufficient to rescue behavior, we introduced a red-shifted opsin (AAV-L7-rsChrimine) into PCs of DTA mice raised to adulthood. During testing we delivered 590-nm light pulses (10 ms, 1 Hz) to the vermis, driving brief, population-wide spiking (new Fig. 8). This periodic re-synchronization left anxiety measures unchanged (open-field center time remained low) but rescued both motor coordination (rotarod latency normalized to control levels) and sociability (time spent with a novel mouse restored). The dissociation implies that distinct behavioral domains differ in their sensitivity to PC timing precision and confirms that reduced synchrony—not cell loss or gross circuit damage (Fig. S1F, S2)—is the primary driver of the motor and social deficits. Together, the optogenetic rescue establishes a bidirectional, mechanistic link between PC synchrony and behavior, addressing the reviewer’s reservations about the original experiment. We have added these descriptions in Result (lines 394-415)

      In summary, while the authors used a specific tool to probe the role of developmental oligodendrocytes in cerebellar physiology and function, they failed to answer specific questions regarding this role, which they could have done with more fine-grained experimental analysis.

      Thank you once again for your constructive suggestions and comments. We believe these changes have improved the clarity and readability of our manuscript.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) Show that ODC loss is specific to the cerebellum.

      We thank the reviewer for requesting additional evidence. To verify that oligodendrocyte ablation was confined to the cerebellum, we injected an AAV carrying mCherry under the human MAG promoter (AAV-hMAG-mCherry) into the cerebellum, and screened the whole brain one week later. As shown in the new Figure 1E–G, mCherry positive cells were present throughout the injected cerebellar cortex (Fig. 1E), but no fluorescent cells were detected in extracerebellar regions—including cerebral cortex, medulla, pons, midbrain. These data demonstrate that our viral approach are specific to the cerebellum, ruling out off-target demyelination elsewhere in the CNS as a contributor to the behavioral and synchrony phenotypes. We have added these descriptions in Result (lines 262-264)

      (2) Characterize the gross morphology of the cerebellum at different developmental stages. Are major cell types all present? Major pathways preserved? 

      We thank the reviewer for requesting additional evidence. To ensure that the developmental loss of oligodendrocytes did not globally disturb cerebellar architecture, we performed a comprehensive histological and electrophysiological survey during development. New data are presented (new Fig. S1–S2, Fig. 3E-H).

      (1) Overall morphology. Low-magnification parvalbumin counterstaining revealed similar cerebellar area in DTA versus control mice at every age (Fig. S1F, G).

      (2) Major neuronal classes. Quantification of parvalbumin-positive Purkinje cells and interneurons showed no differences in density between control and DTA (Fig. S2E, H, M, N, P). Purkinje dendritic arbors were not different between control and DTA (Fig. S2G, O).

      (3) Excitatory and inhibitory synapse inputs. Miniature IPSCs and Parallel-fiber-EPSCs onto Purkinje cells were quantified; neither was differed between groups (Fig. 3E-G).

      (4) Glial populations. IBA1-positive microglia and S100β-positive astrocytes exhibited normal density and marker intensity (Fig. S2).

      Taken together, these analyses show that all major cell types are present at normal density, synaptic inputs from excitatory and inhibitory neurons are preserved, and gross cerebellar morphology is intact after DTA-mediated oligodendrocyte ablation.

      (3) Recording of PNs to see whether the lack of synchrony is due to CFs or simple spikes.

      We thank the reviewer for drawing attention to the work of Ramirez & Stell (2016), which showed that simple-spike bursts can elicit Ca<sup>2+</sup> rises, but only in the soma and proximal dendrites of adult Purkinje cells. In our study, Regions of Interest were restricted to the dendritic arbor, where SS-evoked signals are essentially undetectable (Ramirez & Stell, 2016), whereas climbing-fiber complex spikes generate large, all-or-none transients (Good et al., 2017). Accordingly, even if a rare SS-driven event reached threshold it would likely fall outside our ROIs.

      In addition, we directly imaged CF population activity by expressing GCaMP7f in inferior-olive neurons. Correlation analysis of CF boutons revealed that DTA ablation lowers CF–CF synchrony at P14 (new Fig. 3A–D). Because CF synchrony is a principal driver of Purkinje-cell co-activation, this reduction provides a mechanistic link between oligodendrocyte loss and the hyposynchrony we observe among Purkinje cells. Consistent with this interpretation, electrophysiological recordings showed that parallel-fiber EPSCs and inhibitory synaptic inputs onto Purkinje cells were unchanged by DTA treatment (Fig. 3E-H) , which makes it less likely that the reduced synchrony simply reflects changes in other excitatory or inhibitory synaptic drive.

      That said, SS-dependent somatic Ca<sup>2+</sup> signals could still influence downstream plasticity and long-term cerebellar function. In future work we therefore plan to combine somatic imaging with stage-specific oligodendrocyte manipulations to test whether SS-evoked Ca²⁺ dynamics are themselves modulated by oligodendrocyte support. We have added these descriptions in the Results (lines 301–312) and Discussion (lines 423–434).

      (4) Is CF synapse elimination altered? Test using evoked EPSCs or staining methods.

      We agree that directly testing whether oligodendrocyte loss disturbs climbing-fiber synapse elimination would provide a full mechanistic picture. We are already quantifying climbing fiber terminal number with vGluT2 immunostaining and recording evoked CF-EPSCs in the same DTA model; these data, together with an analysis of how population synchrony is involved in synapse elimination, will form the basis of a separate manuscript now in preparation. To keep the present paper focused on the phenomena we have rigorously documented—transient oligodendrocyte loss and the resulting long-lasting hyposynchrony and abnormal behaviors—we have removed the speculative sentence on oligodendrocyte-mediated synapse elimination. We believe this revision meets the reviewer’s request without over-extending the current dataset.

      Thank you once again for your constructive suggestions and comments. We believe these changes have improved the clarity and readability of our manuscript.

    1. Author response:

      Reviewer #1

      (1) The main weakness is that the study is wholly in vitro, using cultured hippocampal neurons.

      We appreciate this reviewer's concern about the limitation of cultured hippocampal neurons in extracting disease-related spine phenotypes. While we fully recognize this limitation, we consider that this in vitro system has several advantages that contribute to translational research on mental disorders.

      First, our culture system has been shown to support the development of spine morphology similar to that of the hippocampal CA1 excitatory synapse in vivo. High-resolution imaging techniques confirmed that the in vitro spine structure was highly preserved compared with in vivo preparations (Kashiwagi et al., Nature Communications, 2019). The present study used the same culture system and SIM imaging. Therefore, the difference we detected in samples derived from disease models is likely to reflect impairment of molecular mechanisms underlying native structural development in vivo.

      Second, super-resolution imaging of thousands of spines in tissue preparations under precisely controlled conditions cannot be practically applied using currently available techniques. The advantage of our imaging and analytical pipeline is its reproducibility, which enabled us to compare the spine population data from eight different mouse models without normalization.

      Third, a reduced culture system can demonstrate the direct effects of gene mutations on synapse phenotypes, independent of environmental influences. This property is highly advantageous for screening chemical compounds that rescue spine phenotypes. Neuronal firing patterns and receptor functions can also be easily controlled in a culture system. The difference in spine structure between ASD and schizophrenia mouse models is valuable information to establish a drug screening system.

      Fourth, establishing an in vitro system for evaluating synapse phenotypes could reduce the need for animal experiments. Researchers should be aware of the 3Rs principles. In the future, combined with differentiation techniques for human iPS cells, our in vitro approach will enable the evaluation of disease-related spine phenotypes without the need for animal experiments. The effort to establish a reliable culture system should not be eliminated.

      (2) Another weakness is that CaMKIIαK42R/K42R mutant mice are presented as a schizophrenia model.

      We agree with this reviewer that CAMK2A mutations in humans are linked to multiple mental disorders, including developmental disorders, ASD, and schizophrenia. Association of gene mutations with the categories of mental disorders is not straightforward, as the symptoms of these disorders also overlap with each other. For the CaMKIIα K42R/K42R mutant, we considered the following points in its characterization as a model of mental disorder. Analysis of CaMKIIα +/- mice in Dr. Tsuyoshi Miyakawa's lab has provided evidence for the reduced CaMKIIα in schizophrenia-related phenotypes (Yamasaki et al., Mol Brain 2008; Frankland et al., Mol Brain Editorial 2008). It is also known that the CaMKIIα R8H mutation in the kinase domain is linked to schizophrenia (Brown et al., 2021). Both CaMKIIα R8H and CaMKIIα K42R mutations are located in the N-terminal domain and eliminate kinase activity. On the other hand, the representative CaMKIIα E183V mutation identified in ASD patients exhibits unique characteristics, including reduced kinase activity, decreased protein stability and expression levels, and disrupted interactions with ASD-associated proteins such as Shank3 (Stephenson et al., 2017). Importantly, reduced dendritic spines in neurons expressing CaMKIIα E183V is a property opposite to that of the CaMKIIα K42R/K42R mutant, which showed increased spine density (Koeberle et al. 2017).

      Different CAMK2A mutations likely cause distinct phenotypes observed in the broad spectrum of mental disorders. In the revised manuscript, we will include a discussion of the relevant literature to categorize this mouse model appropriately.

      References related to this discussion.

      (1) Yamasaki et al., Mol Brain. 2008 DOI: 10.1186/1756-6606-1-6

      (2) Frankland et al. Mol Brain. 2008 DOI: 10.1186/1756-6606-1-5

      (3) Stephenson et al., J Neurosci. 2017 DOI: 10.1523/JNEUROSCI.2068-16.2017

      (4) Koeberle et al. Sci Rep. 2017 DOI: 10.1038/s41598-017-13728-y

      (5) Brown et al., iScience. 2021 DOI: 10.1016/j.isci.2021.103184

      Reviewer #2

      We recognize the reviewer's comments as important for improving our manuscript. We outline our general approach to addressing major concerns. Detailed responses to each point, along with additional data, will be provided in a formal revised manuscript.

      (1) Demonstrating the robustness of statistical analyses

      We appreciate this reviewer's concern about our strategies for the quantitative analysis of the large spine population. For the PCA analysis (Point 2), our preliminary results indicated that including all parameters or the selected five parameters did not make a significant difference in the relative placement of spines with specific morphologies in the feature space defined by the principal components. This point will be discussed in the revised manuscript. The potential problem of selecting a particular region within a feature space for spine shape analysis (Point 1) can be addressed by using alternative simulation-based approaches, such as bootstrap or permutation tests. These analyses will be included in the revised manuscript. The use of sample numbers in statistical analyses should align with the analysis's purpose (Point 3). When analyzing the distribution of samples in the feature space, it is necessary to use spine numbers for statistical assessment. We will recheck the statistical methods and apply the appropriate method for each analysis. The spine population data in Figures 2 and 8 cannot be directly compared, as the spine visualization methods differ (Figure 2 with membrane DiI labeling; Figure 8 with cytoplasmic GFP labeling) (Point 9). Spine populations of the same size are inevitably plotted in different feature spaces. This point will be discussed more clearly in the revised manuscript.

      (2) Clarification of experimental conditions and data reliability

      Per this reviewer's suggestion, we will provide more information on the genetic background of mice and the differences in spine structure from DIV 18-22 (Points 4 and 5). We will also provide additional validation data for the functional analyses using knockdown and overexpression methods, for which we already have preliminary data (Point 7). Concerns about the interpretation of data obtained from in vitro culture (Point 12), raised by this reviewer, are also noted by reviewer #1. As explained in the response to reviewer #1, we intentionally selected an in vitro culture system to analyze multiple samples derived from mouse models of mental disorders for several reasons. Nevertheless, we will revise the discussion and incorporate the points this reviewer raised regarding the disadvantages of in vitro systems.

      (3) Validation of biological mechanisms and interpretation

      In the computational modeling (Point 6), we started from the data of spine turnover (excluding the data of spine volume increase/decrease), fitted the model with the data, and found that the best-fit model showed three features: fast spine turnover, lower spine density, and smaller size of transient spines in schizophrenia mouse models. As the reviewer noted, information about spine turnover is already present in the input data. However, the other two properties are generated independently of the input data, indicating the value of this model. We plan to add additional confirmatory analyses to this model in the revised manuscript.

      In response to Point 8, we will provide supporting data on the functional role of Ecgr4 in synapse regulation. We will also refine our discussion on the ASD and Schizophrenia phenotypes based on the suggested literature (Points 10 and 11). Quantification of the initial growth of spines is technically demanding, as it requires higher imaging frequency and longer time-lapse recordings to capture rare events. It is difficult to conclude which of the two possibilities, slow spine growth or initial size differences, is correct, based on our available data. This point will be discussed in the revised manuscript (Point 13).

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study examines a valuable question regarding the developmental trajectory of neural mechanisms supporting facial expression processing. Leveraging a rare intracranial EEG (iEEG) dataset including both children and adults, the authors reported that facial expression recognition mainly engaged the posterior superior temporal cortex (pSTC) among children, while both pSTC and the prefrontal cortex were engaged among adults. However, the sample size is relatively small, with analyses appearing incomplete to fully support the primary claims. 

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study investigates how the brain processes facial expressions across development by analyzing intracranial EEG (iEEG) data from children (ages 5-10) and post-childhood individuals (ages 13-55). The researchers used a short film containing emotional facial expressions and applied AI-based models to decode brain responses to facial emotions. They found that in children, facial emotion information is represented primarily in the posterior superior temporal cortex (pSTC) - a sensory processing area - but not in the dorsolateral prefrontal cortex (DLPFC), which is involved in higher-level social cognition. In contrast, post-childhood individuals showed emotion encoding in both regions. Importantly, the complexity of emotions encoded in the pSTC increased with age, particularly for socially nuanced emotions like embarrassment, guilt, and pride. The authors claim that these findings suggest that emotion recognition matures through increasing involvement of the prefrontal cortex, supporting a developmental trajectory where top-down modulation enhances understanding of complex emotions as children grow older.

      Strengths:

      (1) The inclusion of pediatric iEEG makes this study uniquely positioned to offer high-resolution temporal and spatial insights into neural development compared to non-invasive approaches, e.g., fMRI, scalp EEG, etc.

      (2) Using a naturalistic film paradigm enhances ecological validity compared to static image tasks often used in emotion studies.

      (3) The idea of using state-of-the-art AI models to extract facial emotion features allows for high-dimensional and dynamic emotion labeling in real time

      Weaknesses:

      (1) The study has notable limitations that constrain the generalizability and depth of its conclusions. The sample size was very small, with only nine children included and just two having sufficient electrode coverage in the posterior superior temporal cortex (pSTC), which weakens the reliability and statistical power of the findings, especially for analyses involving age

      We appreciated the reviewer’s point regarding the constrained sample size.

      As an invasive method, iEEG recordings can only be obtained from patients undergoing electrode implantation for clinical purposes. Thus, iEEG data from young children are extremely rare,  and rapidly increasing the sample size within a few years is not feasible. However, we are confident in the reliability of our main conclusions. Specifically, 8 children (53 recording contacts in total) and 13 control participants (99 recording contacts in total) with electrode coverage in the DLPFC are included in our DLPFC analysis. This sample size is comparable to other iEEG studies with similar experiment designs [1-3]. 

      For pSTC, we returned to the data set and found another two children who had pSTC coverage. After involving these children’s data, the group-level analysis using permutation test showed that children’s pSTC significantly encode facial emotion in naturalistic contexts (Figure 3B). Notably, the two new children’s (S33 and S49) responses were highly consistent with our previous observations. Moreover, the averaged prediction accuracy in children’s pSTC (r<sub>speech</sub>=0.1565) was highly comparable to that in post-childhood group (r<sub>speech</sub>=0.1515).

      (1) Zheng, J. et al. Multiplexing of Theta and Alpha Rhythms in the Amygdala-Hippocampal Circuit Supports Pafern Separation of Emotional Information. Neuron 102, 887-898.e5 (2019).

      (2) Diamond, J. M. et al. Focal seizures induce spatiotemporally organized spiking activity in the human cortex. Nat. Commun. 15, 7075 (2024).

      (3) Schrouff, J. et al. Fast temporal dynamics and causal relevance of face processing in the human temporal cortex. Nat. Commun. 11, 656 (2020).

      (2) Electrode coverage was also uneven across brain regions, with not all participants having electrodes in both the dorsolateral prefrontal cortex (DLPFC) and pSTC, and most coverage limited to the left hemisphere-hindering within-subject comparisons and limiting insights into lateralization.

      The electrode coverage in each patient is determined entirely by the clinical needs. Only a few patients have electrodes in both DLPFC and pSTC because these two regions are far apart, so it’s rare for a single patient’s suspected seizure network to span such a large territory. However, it does not affect our results, as most iEEG studies combine data from multiple patients to achieve sufficient electrode coverage in each target brain area. As our data are mainly from left hemisphere (due to the clinical needs), this study was not designed to examine whether there is a difference between hemispheres in emotion encoding. Nevertheless, lateralization remains an interesting question that should be addressed in future research, and we have noted this limitation in the Discussion (Page 8, in the last paragraph of the Discussion).

      (3) The developmental differences observed were based on cross-sectional comparisons rather than longitudinal data, reducing the ability to draw causal conclusions about developmental trajectories.  

      In the context of pediatric intracranial EEG, longitudinal data collection is not feasible due to the invasive nature of electrode implantation. We have added this point to the Discussion to acknowledge that while our results reveal robust age-related differences in the cortical encoding of facial emotions, longitudinal studies using non-invasive methods will be essential to directly track developmental trajectories (Page 8, in the last paragraph of Discussion). In addition, we revised our manuscript to avoid emphasis causal conclusions about developmental trajectories in the current study (For example, we use “imply” instead of “suggest” in the fifth paragraph of Discussion).

      (4) Moreover, the analysis focused narrowly on DLPFC, neglecting other relevant prefrontal areas such as the orbitofrontal cortex (OFC) and anterior cingulate cortex (ACC), which play key roles in emotion and social processing.

      We agree that both OFC and ACC are critically involved in emotion and social processing. However, we have no recordings from these areas because ECoG rarely covers the ACC or OFC due to technical constraints. We have noted this limitation in the Discussion(Page 8, in the last paragraph of Discussion). Future follow-up studies using sEEG or non-invasive imaging methods could be used to examine developmental patterns in these regions.

      (5) Although the use of a naturalistic film stimulus enhances ecological validity, it comes at the cost of experimental control, with no behavioral confirmation of the emotions perceived by participants and uncertain model validity for complex emotional expressions in children. A nonfacial music block that could have served as a control was available but not analyzed. 

      The facial emotion features used in our encoding models were extracted by Hume AI models, which were trained on human intensity ratings of large-scale, experimentally controlled emotional expression data[1-2]. Thus, the outputs of Hume AI model reflect what typical facial expressions convey, that is, the presented facial emotion. Our goal of the present study was to examine how facial emotions presented in the videos are encoded in the human brain at different developmental stages. We agree that children’s interpretation of complex emotions may differ from that of adults, resulting in different perceived emotion (i.e., the emotion that the observer subjectively interprets). Behavioral ratings are necessary to study the encoding of subjectively perceived emotion, which is a very interesting direction but beyond the scope of the present work. We have added a paragraph in the Discussion (see Page 8) to explicitly note that our study focused on the encoding of presented emotion.

      We appreciated the reviewer’s point regarding the value of non-facial music blocks. However,  although there are segments in music condition that have no faces presented, these cannot be used as a control condition to test whether the encoding model’s prediction accuracy in pSTC or DLPFC drops to chance when no facial emotion is present. This is because, in the absence of faces, no extracted emotion features are available to be used for the construction of encoding model (see Author response image 1 below).  Thus, we chose to use a different control analysis for the present work. For children’s pSTC, we shuffled facial emotion feature in time to generate a null distribution, which was then used to test the statistical significance of the encoding models (see Methods/Encoding model fitting for details).

      (1) Brooks, J. A. et al. Deep learning reveals what facial expressions mean to people in different cultures. iScience 27, 109175 (2024).

      (2) Brooks, J. A. et al. Deep learning reveals what vocal bursts express in different cultures. Nat. Hum. Behav. 7, 240–250 (2023).

      Author response image 1.

      Time courses of Hume AI extracted facial expression features for the first block of music condition. Only top 5 facial expressions were shown here to due to space limitation.

      (6) Generalizability is further limited by the fact that all participants were neurosurgical patients, potentially with neurological conditions such as epilepsy that may influence brain responses. 

      We appreciated the reviewer’s point. However, iEEG data can only be obtained from clinical populations (usually epilepsy patients) who have electrodes implantation.  Given current knowledge about focal epilepsy and its potential effects on brain activity, researchers believe that epilepsy-affected brains can serve as a reasonable proxy for normal human brains when confounding influences are minimized through rigorous procedures[1]. In our study, we took several steps to ensure data quality: (1) all data segments containing epileptiform discharges were identified and removed at the very beginning of preprocessing, (2) patients were asked to participate the experiment several hours outside the window of seizures. Please see Method for data quality check description (Page 9/ Experimental procedures and iEEG data processing). 

      (1) Parvizi J, Kastner S. 2018. Promises and limitations of human intracranial electroencephalography. Nat Neurosci 21:474–483. doi:10.1038/s41593-018-0108-2

      (7) Additionally, the high temporal resolution of intracranial EEG was not fully utilized, as data were down-sampled and averaged in 500-ms windows.  

      We agree that one of the major advantages of iEEG is its millisecond-level temporal resolution. In our case, the main reason for down-sampling was that the time series of facial emotion features extracted from the videos had a temporal resolution of 2 Hz, which were used for the modelling neural responses. In naturalistic contexts, facial emotion features do not change on a millisecond timescale, so a 500 ms window is sufficient to capture the relevant dynamics. Another advantage of iEEG is its tolerance to motion, which is excessive in young children (e.g., 5-year-olds). This makes our dataset uniquely valuable, suggesting robust representation in the pSTC but not in the DLPFC in young children. Moreover, since our method framework (Figure 1) does not rely on high temporal resolution method, so it can be transferred to non-invasive modalities such as fMRI, enabling future studies to test these developmental patterns in larger populations.

      (8) Finally, the absence of behavioral measures or eye-tracking data makes it difficult to directly link neural activity to emotional understanding or determine which facial features participants afended to.  

      We appreciated this point. Part of our rationale is presented in our response to (5) for the absence of behavioral measures. Following the same rationale, identifying which facial features participants attended to is not necessary for testing our main hypotheses because our analyses examined responses to the overall emotional content of the faces. However, we agree and recommend future studies use eye-tracking and corresponding behavioral measures in studies of subjective emotional understanding. 

      Reviewer #2 (Public review):

      Summary:

      In this paper, Fan et al. aim to characterize how neural representations of facial emotions evolve from childhood to adulthood. Using intracranial EEG recordings from participants aged 5 to 55, the authors assess the encoding of emotional content in high-level cortical regions. They report that while both the posterior superior temporal cortex (pSTC) and dorsolateral prefrontal cortex (DLPFC) are involved in representing facial emotions in older individuals, only the pSTC shows significant encoding in children. Moreover, the encoding of complex emotions in the pSTC appears to strengthen with age. These findings lead the authors to suggest that young children rely more on low-level sensory areas and propose a developmental shiZ from reliance on lower-level sensory areas in early childhood to increased top-down modulation by the prefrontal cortex as individuals mature.

      Strengths: 

      (1) Rare and valuable dataset: The use of intracranial EEG recordings in a developmental sample is highly unusual and provides a unique opportunity to investigate neural dynamics with both high spatial and temporal resolution. 

      (2) Developmentally relevant design: The broad age range and cross-sectional design are well-suited to explore age-related changes in neural representations. 

      (3) Ecological validity: The use of naturalistic stimuli (movie clips) increases the ecological relevance of the findings. 

      (4) Feature-based analysis: The authors employ AIbased tools to extract emotion-related features from naturalistic stimuli, which enables a data-driven approach to decoding neural representations of emotional content. This method allows for a more fine-grained analysis of emotion processing beyond traditional categorical labels. 

      Weaknesses: 

      (1) The emotional stimuli included facial expressions embedded in speech or music, making it difficult to isolate neural responses to facial emotion per se from those related to speech content or music-induced emotion. 

      We thank the reviewer for their raising this important point. We agree that in naturalistic settings, face often co-occur with speech, and that these sources of emotion can overlap. However, background music induced emotions have distinct temporal dynamics which are separable from facial emotion (See the Author response image 2 (A) and (B) below). In addition, face can convey a wide range of emotions (48 categories in Hume AI model), whereas music conveys far fewer (13 categories reported by a recent study [1]). Thus, when using facial emotion feature time series as regressors (with 48 emotion categories and rapid temporal dynamics), the model performance will reflect neural encoding of facial emotion in the music condition, rather than the slower and lower-dimensional emotion from music. 

      For the speech condition, we acknowledge that it is difficult to fully isolate neural responses to facial emotion from those to speech when the emotional content from faces and speech highly overlaps. However, in our study, (1) the time courses of emotion features from face and voice are still different (Author response image 2 (C) and (D)), (2) our main finding that DLPFC encodes facial expression information in postchildhood individuals but not in young children was found in both speech and music condition (Figure 2B and 2C). In music condition, neural responses to facial emotion are not affected by speech. Thus, we have included the DLPFC results from the music condition in the revised manuscript (Figure 2C), and we acknowledge that this issue should be carefully considered in future studies using videos with speech, as we have indicated in the future directions in the last paragraph of Discussion.

      (1) Cowen, A. S., Fang, X., Sauter, D. & Keltner, D. What music makes us feel: At least 13 dimensions organize subjective experiences associated with music across different cultures. Proc Natl Acad Sci USA 117, 1924–1934 (2020).

      Author response image 2.

      Time courses of the amusement. (A) and (B) Amusement conveyed by face or music in a 30-s music block. Facial emotion features are extracted by Hume AI. For emotion from music, we approximated the amusement time course using a weighted combination of low-level acoustic features (RMS energy, spectral centroid, MFCCs), which capture intensity, brightness, and timbre cues linked to amusement. Notice that music continues when there are no faces presented. (C) and (D) Amusement conveyed by face or voice in a 30-s speech block. From 0 to 5 seconds, a girl is introducing her friend to a stranger. The camera focuses on the friend, who appears nervous, while the girl’s voice sounds cheerful. This mismatch explains why the shapes of the two time series differ at the beginning. Such situations occur frequently in naturalistic movies

      (2) While the authors leveraged Hume AI to extract facial expression features from the video stimuli, they did not provide any validation of the tool's accuracy or reliability in the context of their dataset. It remains unclear how well the AI-derived emotion ratings align with human perception, particularly given the complexity and variability of naturalistic stimuli. Without such validation, it is difficult to assess the interpretability and robustness of the decoding results based on these features.  

      Hume AI models were trained and validated by human intensity ratings of large-scale, experimentally controlled emotional expression data [1-2]. The training process used both manual annotations from human raters and deep neural networks. Over 3000 human raters categorized facial expressions into emotion categories and rated on a 1-100 intensity scale. Thus, the outputs of Hume AI model reflect what typical facial expressions convey (based on how people actually interpret them), that is, the presented facial emotion. Our goal of the present study was to examine how facial emotions presented in the videos are encoded in the human brain at different developmental stages. We agree that the interpretation of facial emotions may be different in individual participants, resulting in different perceived emotion (i.e., the emotion that the observer subjectively interprets). Behavioral ratings are necessary to study the encoding of subjectively perceived emotion, which is a very interesting direction but beyond the scope of the present work. We have added text in the Discussion to explicitly note that our study focused on the encoding of presented emotion (second paragraph in Page 8).

      (1) Brooks, J. A. et al. Deep learning reveals what facial expressions mean to people in different cultures. iScience 27, 109175 (2024).

      (2) Brooks, J. A. et al. Deep learning reveals what vocal bursts express in different cultures. Nat. Hum. Behav. 7, 240–250 (2023).

      (3) Only two children had relevant pSTC coverage, severely limiting the reliability and generalizability of results.  

      We appreciated this point and agreed with both reviewers who raised it as a significant concern. As described in response to reviewer 1 (comment 1), we have added data from another two children who have pSTC coverage. Group-level analysis using permutation test showed that children’s pSTC significantly encode facial emotion in naturalistic contexts (Figure 3B). Because iEEG data from young children are extremely rare, rapidly increasing the sample size within a few years is not feasible. However, we are confident in the reliability of our conclusion that children’s pSTC can encode facial emotion. First,  the two new children’s responses (S33 and S49) from pSTC were highly consistent with our previous observations (see individual data in Figure 3B). Second, the averaged prediction accuracy in children’s pSTC (r<sub>speech</sub>=0.1565) was highly comparable to that in post-childhood group (r<sub>speech</sub>=0.1515).

      (4) The rationale for focusing exclusively on high-frequency activity for decoding emotion representations is not provided, nor are results from other frequency bands explored.   

      We focused on high-frequency broadband (HFB) activity because it is widely considered to reflect the responses of local neuronal populations near the recording electrode, whereas low-frequency oscillations in the theta, alpha, and beta ranges are thought to serve as carrier frequencies for long-range communication across distributed networks[1-2]. Since our study aimed to examine the representation of facial emotion in localized cortical regions (DLPFC and pSTC), HFB activity provides the most direct measure of the relevant neural responses. We have added this rationale to the manuscript (Page 3).

      (1) Parvizi, J. & Kastner, S. Promises and limitations of human intracranial electroencephalography. Nat. Neurosci. 21, 474–483 (2018).

      (2) Buzsaki, G. Rhythms of the Brain. (Oxford University Press, Oxford, 200ti).

      (5) The hypothesis of developmental emergence of top-down prefrontal modulation is not directly tested. No connectivity or co-activation analyses are reported, and the number of participants with simultaneous coverage of pSTC and DLPFC is not specified.  

      Directional connectivity analysis results were not shown because only one child has simultaneous coverage of pSTC and DLPFC. However, the  Granger Causality results from post-childhood group (N=7) clearly showed that the influence in the alpha/beta band from DLPFC to pSTC (top-down) is gradually increased above the onset of face presentation (Author response image 3, below left, plotted in red). By comparison, the influence in the alpha/beta band from pSTC to DLPFC (bottom-up) is gradually decreased after the onset of face presentation (Author response image 3, below left, blue curve). The influence in alpha/beta band from DLPFC to pSTC was significantly increased at 750 and 1250 ms after the face presentation (face vs nonface, paired t-test, Bonferroni  corrected P=0.005, 0.006), suggesting an enhanced top-down modulation in the post-childhood group during watching emotional faces. Interestingly, this top-down influence appears very different in the 8-year-old child at 1250 ms after the face presentation (Author response image 3, below left, black curve).

      As we cannot draw direct conclusions from the single-subject sample presented here, the top-down hypothesis is introduced only as a possible explanation for our current results. We have removed potentially misleading statements, and we plan to test this hypothesis directly using MEG in the future.

      Author response image 3.

      Difference of Granger causality indices (face – nonface) in alpha/beta and gamma band for both directions. We identified a series of face onset in the movie that paticipant watched. Each trial was defined as -0.1 to 1.5 s relative to the onset. For the non-face control trials, we used houses, animals and scenes. Granger causality was calculated for 0-0.5 s, 0.5-1 s and 1-1.5 s time window. For the post-childhood group, GC indices were averaged across participants. Error bar is sem.

      (6) The "post-childhood" group spans ages 13-55, conflating adolescence, young adulthood, and middle age. Developmental conclusions would benefit from finer age stratification.  

      We appreciate this insightful comment. Our current sample size does not allow such stratification. But we plan to address this important issue in future MEG studies with larger cohorts.

      (7) The so-called "complex emotions" (e.g., embarrassment, pride, guilt, interest) used in the study often require contextual information, such as speech or narrative cues, for accurate interpretation, and are not typically discernible from facial expressions alone. As such, the observed age-related increase in neural encoding of these emotions may reflect not solely the maturation of facial emotion perception, but rather the development of integrative processing that combines facial, linguistic, and contextual cues. This raises the possibility that the reported effects are driven in part by language comprehension or broader social-cognitive integration, rather than by changes in facial expression processing per se.  

      We agree with this interpretation. Indeed, our results already show that speech influences the encoding of facial emotion in the DLPFC differently in the childhood and post-childhood groups (Figure 2D), suggesting that children’s ability to integrate multiple cues is still developing. Future studies are needed to systematically examine how linguistic cues and prior experiences contribute to the understanding of complex emotions from faces, which we have added to our future directions section (last paragraph in Discussion, Page 8-9 ).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      In the introduction: "These neuroimaging data imply that social and emotional experiences shape the prefrontal cortex's involvement in processing the emotional meaning of faces throughout development, probably through top-down modulation of early sensory areas." Aren't these supposed to be iEEG data instead of neuroimaging? 

      Corrected.

      Reviewer #2 (Recommendations for the authors):

      This manuscript would benefit from several improvements to strengthen the validity and interpretability of the findings:

      (1) Increase the sample size, especially for children with pSTC coverage. 

      We added data from another two children who have pSTC coverage. Please see our response to reviewer 2’s comment 3 and reviewer 1’s comment 1.

      (2) Include directional connectivity analyses to test the proposed top-down modulation from DLPFC to pSTC. 

      Thanks for the suggestion. Please see our response to reviewer 2’s comment 5.

      (3) Use controlled stimuli in an additional experiment to separate the effects of facial expression, speech, and music. 

      This is an excellent point. However, iEEG data collection from children is an exceptionally rare opportunity and typically requires many years, so we are unable to add a controlled-stimulus experiment to the current study. We plan to consider using controlled stimuli to study the processing of complex emotion using non-invasive method in the future. In addition, please see our response to reviewer 2’s comment 1 for a description of how neural responses to facial expression and music are separated in our study.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      This is a strong paper that presents a clear advance in multi-animal tracking. The authors introduce an updated version of idtracker.ai that reframes identity assignment as a contrastive learning problem rather than a classification task requiring global fragments. This change leads to gains in speed and accuracy. The method eliminates a known bottleneck in the original system, and the benchmarking across species is comprehensive and well executed. I think the results are convincing and the work is significant.

      Strengths

      The main strengths are the conceptual shift from classification to representation learning, the clear performance gains, and the fact that the new version is more robust. Removing the need for global fragments makes the software more flexible in practice, and the accuracy and speed improvements are well demonstrated. The software appears thoughtfully implemented, with GUI updates and integration with pose estimators.

      Weaknesses

      I don't have any major criticisms, but I have identified a few points that should be addressed to improve the clarity and accuracy of the claims made in the paper.

      (1) The title begins with "New idtracker.ai," which may not age well and sounds more promotional than scientific. The strength of the work is the conceptual shift to contrastive representation learning, and it might be more helpful to emphasize that in the title rather than branding it as "new."

      We considered using “Contrastive idtracker.ai”. However, we thought that readers could then think that we believe they could use both the old idtracker.ai or this contrastive version. But we want to say that the new version is the one to use as it is better in both accuracy and tracking times. We think “New idtracker.ai” communicates better that this version is the version we recommend.

      (2) Several technical points regarding the comparison between TRex (a system evaluated in the paper) and idtracker.ai should be addressed to ensure the evaluation is fair and readers are fully informed.

      (2.1) Lines 158-160: The description of TRex as based on "Protocol 2 of idtracker.ai" overlooks several key additions in TRex, such as posture image normalization, tracklet subsampling, and the use of uniqueness feedback during training. These features are not acknowledged, and it's unclear whether TRex was properly configured - particularly regarding posture estimation, which appears to have been omitted but isn't discussed. Without knowing the actual parameters used to make comparisons, it's difficult to dassess how the method was evaluated.

      We added the information about the key additions of TRex in the section “The new idtracker.ai uses representation learning”, lines 153-157. Posture estimation in TRex was not explicitly used but neither disabled during the benchmark; we clarified this in the last paragraph of “Benchmark of accuracy and tracking time”, lines 492-495.

      (2.2) Lines 162-163: The paper implies that TRex gains speed by avoiding Protocol 3, but in practice, idtracker.ai also typically avoids using Protocol 3 due to its extremely long runtime. This part of the framing feels more like a rhetorical contrast than an informative one.

      We removed this, see new lines 153-157.

      (2.3) Lines 277-280: The contrastive loss function is written using the label l, but since it refers to a pair of images, it would be clearer and more precise to write it as l_{I,J}. This would help readers unfamiliar with contrastive learning understand the formulation more easily.

      We added this change in lines 613-620.

      (2.4) Lines 333-334: The manuscript states that TRex can fail to track certain videos, but this may be inaccurate depending on how the authors classify failures. TRex may return low uniqueness scores if training does not converge well, but this isn't equivalent to tracking failure. Moreover, the metric reported by TRex is uniqueness, not accuracy. Equating the two could mislead readers. If the authors did compare outputs to human-validated data, that should be stated more explicitly.

      We observed TRex crashing without outputting any trajectories on some occasions (Appendix 1—figure 1), and this is what we labeled as “failure”. These failures happened in the most difficult videos of our benchmark, that’s why we treated them the same way as idtracker.ai going to P3. We clarified this in new lines 464-469.

      The accuracy measured in our benchmark is not estimated but it is human-validated (see section Computation of tracking accuracy in Appendix 1). Both softwares report some quality estimators at the end of a tracking (“estimated accuracy” for idtracker.ai and "uniqueness” for TRex) but these were not used in the benchmark.

      (2.5) Lines 339-341: The evaluation approach defines a "successful run" and then sums the runtime across all attempts up to that point. If success is defined as simply producing any output, this may not reflect how experienced users actually interact with the software, where parameters are iteratively refined to improve quality.

      Yes, our benchmark was designed to be agnostic to the different experiences of the user. Also, our benchmark was designed for users that do not inspect the trajectories to choose parameters again not to leave room for potential subjectivity.

      (2.6) Lines 344-346: The simulation process involves sampling tracking parameters 10,000 times and selecting the first "successful" run. If parameter tuning is randomized rather than informed by expert knowledge, this could skew the results in favor of tools that require fewer or simpler adjustments. TRex relies on more tunable behavior, such as longer fragments improving training time, which this approach may not capture.

      We precisely used the TRex parameter track_max_speed to elongate fragments for optimal tracking. Rather than randomized parameter tuning, we defined the “valid range” for this parameter so that all values in it would produce a decent fragment structure. We used this procedure to avoid worsening those methods that use more parameters.

      (2.7) Line 354 onward: TRex was evaluated using two varying parameters (threshold and track_max_speed), while idtracker.ai used only one (intensity_threshold). With a fixed number of samples, this asymmetry could bias results against TRex. In addition, users typically set these parameters based on domain knowledge rather than random exploration.

      idtracker.ai and TRex have several parameters. Some of them have a single correct value (e.g. number of animals) or the default value that the system computes is already good (e.g. minimum blob size). For a second type of parameters, the system finds a value that is in general not as good, so users need to modify them. In general, users find that for this second type of parameter there is a valid interval of possible values, from which they need to choose a single value to run the system. idtracker.ai has intensity_threshold as the only parameter of this second type and TRex has two: threshold and track_max_speed. For these parameters, choosing one value or another within the valid interval can give different tracking results. Therefore, when we model a user that wants to run the system once except if it goes to P3 (idtracker.ai) or except if it crashes (TRex), it is these parameters we sample from within the valid interval to get a different value for each run of the system. We clarify this in lines 452-469 of the section “Benchmark of accuracy and tracking time”.

      Note that if we chose to simply run old idtracker.ai (v4 or v5) or TRex a single time, this would benefit the new idtracker.ai (v6). This is because old idtracker.ai can enter the very slow protocol 3 and TRex can fail to track. So running old idtracker.ai or TRex up to 5 times until old idtracker.ai does not use Protocol 3 and TRex does not fail is to make them as good as they can be with respect to the new idtracker.ai

      (2.8) Figure 2-figure supplement 3: The memory usage comparison lacks detail. It's unclear whether RAM or VRAM was measured, whether shared or compressed memory was included, or how memory was sampled. Since both tools dynamically adjust to system resources, the relevance of this comparison is questionable without more technical detail.

      We modified the text in the caption (new Figure 1-figure supplement 2) adding the kind of memory we measured (RAM) and how we measured it. We already have a disclaimer for this plot saying that memory management depends on the machine's available resources. We agree that this is a simple analysis of the usage of computer resources.

      (3) While the authors cite several key papers on contrastive learning, they do not use the introduction or discussion to effectively situate their approach within related fields where similar strategies have been widely adopted. For example, contrastive embedding methods form the backbone of modern facial recognition and other image similarity systems, where the goal is to map images into a latent space that separates identities or classes through clustering. This connection would help emphasize the conceptual strength of the approach and align the work with well-established applications. Similarly, there is a growing literature on animal re-identification (ReID), which often involves learning identity-preserving representations across time or appearance changes. Referencing these bodies of work would help readers connect the proposed method with adjacent areas using similar ideas, and show that the authors are aware of and building on this wider context.

      We have now added a new section in Appendix 3, “Differences with previous work in contrastive/metric learning” (lines 792-841) to include references to previous work and a description of what we do differently.

      (4) Some sections of the Results text (e.g., lines 48-74) read more like extended figure captions than part of the main narrative. They include detailed explanations of figure elements, sorting procedures, and video naming conventions that may be better placed in the actual figure captions or moved to supplementary notes. Streamlining this section in the main text would improve readability and help the central ideas stand out more clear

      Thank you for pointing this out. We have rewritten the Results, for example streamlining the old lines 48-74 (new lines 42-48)  by moving the comments about names, files and order of videos to the caption of Figure 1.

      Overall, though, this is a high-quality paper. The improvements to idtracker.ai are well justified and practically significant. Addressing the above comments will strengthen the work, particularly by clarifying the evaluation and comparisons.

      We thank the reviewer for the detailed suggestions. We believe we have taken all of them into consideration to improve the ms.

      Reviewer #2 (Public review):

      Summary:

      This work introduces a new version of the state-of-the-art idtracker.ai software for tracking multiple unmarked animals. The authors aimed to solve a critical limitation of their previous software, which relied on the existence of "global fragments" (video segments where all animals are simultaneously visible) to train an identification classifier network, in addition to addressing concerns with runtime speed. To do this, the authors have both re-implemented the backend of their software in PyTorch (in addition to numerous other performance optimizations) as well as moving from a supervised classification framework to a self-supervised, contrastive representation learning approach that no longer requires global fragments to function. By defining positive training pairs as different images from the same fragment and negative pairs as images from any two co-existing fragments, the system cleverly takes advantage of partial (but high-confidence) tracklets to learn a powerful representation of animal identity without direct human supervision. Their formulation of contrastive learning is carefully thought out and comprises a series of empirically validated design choices that are both creative and technically sound. This methodological advance is significant and directly leads to the software's major strengths, including exceptional performance improvements in speed and accuracy and a newfound robustness to occlusion (even in severe cases where no global fragments can be detected). Benchmark comparisons show the new software is, on average, 44 times faster (up to 440 times faster on difficult videos) while also achieving higher accuracy across a range of species and group sizes. This new version of idtracker.ai is shown to consistently outperform the closely related TRex software (Walter & Couzin, 2021\), which, together with the engineering innovations and usability enhancements (e.g., outputs convenient for downstream pose estimation), positions this tool as an advancement on the state-of-the-art for multi-animal tracking, especially for collective behavior studies.

      Despite these advances, we note a number of weaknesses and limitations that are not well addressed in the present version of this paper:

      Weaknesses

      (1) The contrastive representation learning formulation. Contrastive representation learning using deep neural networks has long been used for problems in the multi-object tracking domain, popularized through ReID approaches like DML (Yi et al., 2014\) and DeepReID (Li et al., 2014). More recently, contrastive learning has become more popular as an approach for scalable self-supervised representation learning for open-ended vision tasks, as exemplified by approaches like SimCLR (Chen et al., 2020), SimSiam (Chen et al., 2020\), and MAE (He et al., 2021\) and instantiated in foundation models for image embedding like DINOv2 (Oquab et al., 2023). Given their prevalence, it is useful to contrast the formulation of contrastive learning described here relative to these widely adopted approaches (and why this reviewer feels it is appropriate):

      (1.1) No rotations or other image augmentations are performed to generate positive examples. These are not necessary with this approach since the pairs are sampled from heuristically tracked fragments (which produces sufficient training data, though see weaknesses discussed below) and the crops are pre-aligned egocentrically (mitigating the need for rotational invariance).

      (1.2) There is no projection head in the architecture, like in SimCLR. Since classification/clustering is the only task that the system is intended to solve, the more general "nuisance" image features that this architectural detail normally affords are not necessary here.

      (1.3) There is no stop gradient operator like in BYOL (Grill et al., 2020\) or SimSiam. Since the heuristic tracking implicitly produces plenty of negative pairs from the fragments, there is no need to prevent representational collapse due to class asymmetry. Some care is still needed, but the authors address this well through a pair sampling strategy (discussed below).

      (1.4) Euclidean distance is used as the distance metric in the loss rather than cosine similarity as in most contrastive learning works. While cosine similarity coupled with L2-normalized unit hypersphere embeddings has proven to be a successful recipe to deal with the curse of dimensionality (with the added benefit of bounded distance limits), the authors address this through a cleverly constructed loss function that essentially allows direct control over the intra- and inter-cluster distance (D\_pos and D\_neg). This is a clever formulation that aligns well with the use of K-means for the downstream assignment step.

      No concerns here, just clarifications for readers who dig into the review. Referencing the above literature would enhance the presentation of the paper to align with the broader computer vision literature.

      Thank you for this detailed comparison. We have now added a new section in Appendix 3, “Differences with previous work in contrastive/metric learning” (lines 792-841) to include references to previous work and a description of what we do differently, including the points raised by the reviewer.

      (2) Network architecture for image feature extraction backbone. As most of the computations that drive up processing time happen in the network backbone, the authors explored a variety of architectures to assess speed, accuracy, and memory requirements. They land on ResNet18 due to its empirically determined performance. While the experiments that support this choice are solid, the rationale behind the architecture selection is somewhat weak. The authors state that: "We tested 23 networks from 8 different families of state-of-the-art convolutional neural network architectures, selected for their compatibility with consumer-grade GPUs and ability to handle small input images (20 × 20 to 100 × 100 pixels) typical in collective animal behavior videos."

      (2.1) Most modern architectures have variants that are compatible with consumer-grade GPUs. This is true of, for example, HRNet (Wang et al., 2019), ViT (Dosovitskiy et al., 2020), SwinT (Liu et al., 2021), or ConvNeXt (Liu et al., 2022), all of which report single GPU training and fast runtime speeds through lightweight configuration or subsequent variants, e.g., MobileViT (Mehta et al., 2021). The authors may consider revising that statement or providing additional support for that claim (e.g., empirical experiments) given that these have been reported to outperform ResNet18 across tasks.

      Following the recommendation of the reviewer, we tested the architectures SwinT, ConvNeXt and ViT. We found out that none of them outperformed ResNet18 since they all showed a slower learning curve. This would result in higher tracking times. These tests are now included in the section “Network architecture” (lines 550-611).

      (2.2) The compatibility of different architectures with small image sizes is configurable. Most convolutional architectures can be readily adapted to work with smaller image sizes, including 20x20 crops. With their default configuration, they lose feature map resolution through repeated pooling and downsampling steps, but this can be readily mitigated by swapping out standard convolutions with dilated convolutions and/or by setting the stride of pooling layers to 1, preserving feature map resolution across blocks. While these are fairly straightforward modifications (and are even compatible with using pretrained weights), an even more trivial approach is to pad and/or resize the crops to the default image size, which is likely to improve accuracy at a possibly minimal memory and runtime cost. These techniques may even improve the performance with the architectures that the authors did test out.

      The only two tested architectures that require a minimum image size are AlexNet and DenseNet. DenseNet proved to underperform ResNet18 in the videos where the images are sufficiently large. We have tested AlexNet with padded images to see that it also performs worse than ResNet18 (see Appendix 3—figure 1).

      We also tested the initialization of ResNet18 with pre-trained weights from ImageNet (in Appendix 3—figure 2) and it proved to bring no benefit to the training speed (added in lines 591-592).

      (2.3) The authors do not report whether the architecture experiments were done with pretrained or randomly initialized weights.

      We adapted the text to make it clear that the networks are always randomly initialized (lines 591-592, lines 608-609 and the captions of Appendix 3—figure 1 and 2).

      (2.4) The authors do not report some details about their ResNet18 design, specifically whether a global pooling layer is used and whether the output fully connected layer has any activation function. Additionally, they do not report the version of ResNet18 employed here, namely, whether the BatchNorm and ReLU are applied after (v1) or before (v2) the conv layers in the residual path.

      We use ResNet18 v1 with no activation function nor bias in its last layer (this has been clarified in the lines 606-608). Also, by design, ResNet has a global average pool right before the last fully connected layer which we did not remove. In response to the reviewer, Resnet18 v2 was tested and its performance is the same as that of v1 (see Appendix 3—figure 1 and lines 590-591).

      (3) Pair sampling strategy. The authors devised a clever approach for sampling positive and negative pairs that is tailored to the nature of the formulation. First, since the positive and negative labels are derived from the co-existence of pretracked fragments, selection has to be done at the level of fragments rather than individual images. This would not be the case if one of the newer approaches for contrastive learning were employed, but it serves as a strength here (assuming that fragment generation/first pass heuristic tracking is achievable and reliable in the dataset). Second, a clever weighted sampling scheme assigns sampling weights to the fragments that are designed to balance "exploration and exploitation". They weigh samples both by fragment length and by the loss associated with that fragment to bias towards different and more difficult examples.

      (3.1) The formulation described here resembles and uses elements of online hard example mining (Shrivastava et al., 2016), hard negative sampling (Robinson et al., 2020\), and curriculum learning more broadly. The authors may consider referencing this literature (particularly Robinson et al., 2020\) for inspiration and to inform the interpretation of the current empirical results on positive/negative balancing.

      Following this recommendation, we added references of hard negative mining in the new section “Differences with previous work in contrastive/metric learning”, lines 792-841. Regarding curriculum learning, even though in spirit it might have parallels with our sampling method in the sense that there is a guided training of the network, we believe the approach is more similar to an exploration-exploitation paradigm.

      (4) Speed and accuracy improvements. The authors report considerable improvements in speed and accuracy of the new idTracker (v6) over the original idTracker (v4?) and TRex. It's a bit unclear, however, which of these are attributable to the engineering optimizations (v5?) versus the representation learning formulation.

      (4.1) Why is there an improvement in accuracy in idTracker v5 (L77-81)? This is described as a port to PyTorch and improvements largely related to the memory and data loading efficiency. This is particularly notable given that the progression went from 97.52% (v4; original) to 99.58% (v5; engineering enhancements) to 99.92% (v6; representation learning), i.e., most of the new improvement in accuracy owes to the "optimizations" which are not the central emphasis of the systematic evaluations reported in this paper.

      V5 was a two year-effort designed to improve time efficiency of v4. It was also a surprise to us that accuracy was higher, but that likely comes from the fact that the substituted code from v4 contained some small bug/s. The improvements in v5 are retained in v6 (contrastive learning) and v6 has higher accuracy and shorter tracking times. The difference in v6 for this extra accuracy and shorter tracking times is contrastive learning.

      (4.2) What about the speed improvements? Relative to the original (v4), the authors report average speed-ups of 13.6x in v5 and 44x in v6. Presumably, the drastic speed-up in v6 comes from a lower Protocol 2 failure rate, but v6 is not evaluated in Figure 2 - figure supplement 2.

      Idtracker.ai v5 runs an optimized Protocol 2 and, sometimes, the Protocol 3. But v6 doesn’t run either of them. While P2 is still present in v6 as a fallback protocol when contrastive fails, in our v6 benchmark P2 was never needed. So the v6 speedup comes from replacing both P2 and P3 with the contrastive algorithm.

      (5) Robustness to occlusion. A major innovation enabled by the contrastive representation learning approach is the ability to tolerate the absence of a global fragment (contiguous frames where all animals are visible) by requiring only co-existing pairs of fragments owing to the paired sampling formulation. While this removes a major limitation of the previous versions of idtracker.ai, its evaluation could be strengthened. The authors describe an ablation experiment where an arc of the arena is masked out to assess the accuracy under artificially difficult conditions. They find that the v6 works robustly up to significant proportions of occlusions, even when doing so eliminates global fragments.

      (5.1) The experiment setup needs to be more carefully described.

      (5.1.1) What does the masking procedure entail? Are the pixels masked out in the original video or are detections removed after segmentation and first pass tracking is done?

      The mask is defined as a region of interest in the software. This means that it is applied at the segmentation step where the video frame is converted to a foreground-background binary image. The region of interest is applied here, converting to background all pixels not inside of it. We clarified this in the newly added section Occlusion tests, lines 240-244.

      (5.1.2) What happens at the boundary of the mask? (Partial segmentation masks would throw off the centroids, and doing it after original segmentation does not realistically model the conditions of entering an occlusion area.)

      Animals at the boundaries of the mask are partially detected. This can change the location of their detected centroid. That’s why, when computing the ground-truth accuracy for these videos, only the groundtruth centroids that were at minimum 15 pixels further from the mask were considered. We clarified this in the newly added section Occlusion tests, lines 248-251.

      (5.1.3) Are fragments still linked for animals that enter and then exit the mask area?

      No artificial fragment linking was added in these videos. Detected fragments are linked the usual way. If one animal hides into the mask, the animal disappears so the fragment breaks.  We clarified this in the newly added section Occlusion tests, lines 245-247.

      (5.1.4) How is the evaluation done? Is it computed with or without the masked region detections?

      The groundtruth used to validate these videos contains the positions of all animals at all times. But only the positions outside the mask at each frame were considered to compute the tracking accuracy. We clarified this in the newly added section Occlusion tests, lines 248-251.

      (5.2) The circular masking is perhaps not the most appropriate for the mouse data, which is collected in a rectangular arena.

      We wanted to show the same proof of concept in different videos. For that reason, we used to cover the arena parametrized by an angle. In the rectangular arena the circular masking uses an external circle, so it is covering the rectangle parametrized by an angle.

      (5.3) The number of co-existing fragments, which seems to be the main determinant of performance that the authors derive from this experiment, should be reported for these experiments. In particular, a "number of co-existing fragments" vs accuracy plot would support the use of the 0.25(N-1) heuristic and would be especially informative for users seeking to optimize experimental and cage design. Additionally, the number of co-existing fragments can be artificially reduced in other ways other than a fixed occlusion, including random dropout, which would disambiguate it from potential allocentric positional confounds (particularly relevant in arenas where egocentric pose is correlated with allocentric position).

      We included the requested analysis about the fragment connectivity in Figure 3-figure supplement 1. We agree that there can be additional ways of reducing co-existing fragments, but we think the occlusion tests have the additional value that there are many real experiments similar to this test.

      (6) Robustness to imaging conditions. The authors state that "the new idtracker.ai can work well with lower resolutions, blur and video compression, and with inhomogeneous light (Figure 2 - figure supplement 4)." (L156). Despite this claim, there are no speed or accuracy results reported for the artificially corrupted data, only examples of these image manipulations in the supplementary figure.

      We added this information in the same image, new Figure 1 - figure supplement 3.

      (7) Robustness across longitudinal or multi-session experiments. The authors reference idmatcher.ai as a compatible tool for this use case (matching identities across sessions or long-term monitoring across chunked videos), however, no performance data is presented to support its usage. This is relevant as the innovations described here may interact with this setting. While deep metric learning and contrastive learning for ReID were originally motivated by these types of problems (especially individuals leaving and entering the FOV), it is not clear that the current formulation is ideally suited for this use case. Namely, the design decisions described in point 1 of this review are at times at odds with the idea of learning generalizable representations owing to the feature extractor backbone (less scalable), low-dimensional embedding size (less representational capacity), and Euclidean distance metric without hypersphere embedding (possible sensitivity to drift). It's possible that data to support point 6 can mitigate these concerns through empirical results on variations in illumination, but a stronger experiment would be to artificially split up a longer video into shorter segments and evaluate how generalizable and stable the representations learned in one segment are across contiguous ("longitudinal") or discontiguous ("multi-session") segments.

      We have now added a test to prove the reliability of idmatcher.ai in v6. In this test, 14 videos are taken from the benchmark and split in two non-overlapping parts (with a 200 frames gap in between). idmatcher.ai is run between the two parts presenting a 100% accuracy identity matching across all of them (see section “Validity of idmatcher.ai in the new idtracker.ai”, lines 969-1008).

      We thank the reviewer for the detailed suggestions. We believe we have taken all of them into consideration to improve the ms.

      Reviewer #3 (Public review):

      Summary

      The authors propose a new version of idTracker.ai for animal tracking. Specifically, they apply contrastive learning to embed cropped images of animals into a feature space where clusters correspond to individual animal identities.

      Strengths

      By doing this, the new software alleviates the requirement for so-called global fragments - segments of the video, in which all entities are visible/detected at the same time - which was necessary in the previous version of the method. In general, the new method reduces the tracking time compared to the previous versions, while also increasing the average accuracy of assigning the identity labels.

      Weaknesses

      The general impression of the paper is that, in its current form, it is difficult to disentangle the old from the new method and understand the method in detail. The manuscript would benefit from a major reorganization and rewriting of its parts. There are also certain concerns about the accuracy metric and reducing the computational time.

      We have made the following modifications in the presentation:

      (1) We have added section tiles to the main text so it is clearer what tracking system we are referring to. For example, we now have sections “Limitation of the original idtracker.ai”, “Optimizing idtracker.ai without changes in the learning method” and “The new idtracker.ai uses representation learning”.

      (2) We have completely rewritten all the text of the ms until we start with contrastive learning. Old L20-89 is now L20-L66, much shorter and easier to read.

      (3) We have rewritten the first 3 paragraphs in the section “The new idtracker.ai uses representation learning” (lines 68-92).

      (4) We now expanded Appendix 3 to discuss the details of our approach  (lines 539-897).  It discusses in detail the steps of the algorithm, the network architecture, the loss function, the sampling strategy, the clustering and identity assignment, and the stopping criteria in training

      (5) To cite previous work in detail and explain what we do differently, we have now added in Appendix 3 the new section “Differences with previous work in contrastive/metric learning” (lines 792-841).

      Regarding accuracy metrics, we have replaced our accuracy metric with the standard metric IDF1. IDF1 is the standard metric that is applied to systems in which the goal is to maintain consistent identities across time. See also the section in Appendix 1 "Computation of tracking accuracy” (lines 414-436) explaining IDF1 and why this is an appropriate metric for our goal.

      Using IDF1 we obtain slightly higher accuracies for the idtracker.ai systems. This is the comparison of mean accuracy over all our benchmark for our previous accuracy score and the new one for the full trajectories:

      v4:   97.42% -> 98.24%

      v5:   99.41% -> 99.49%

      v6:   99.74% -> 99.82%

      trex: 97.89% -> 97.89%

      We thank the reviewer for the suggestions about presentation and about the use of more standard metrics.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 1a: A graphical legend inset would make it more readable since there are multiple colors, line styles, and connecting lines to parse out.

      Following this recommendation, we added a graphical legend in the old Figure 1 (new Figure 2).

      (2) L46: "have images" → "has images".

      We applied this correction. Line 35.

      (3) L52: "videos start with a letter for the species (z,**f**,m)", but "d" is used for fly videos.

      We applied this correction in the caption of Figure 1.

      (4) L62: "with Protocol 3 a two-step process" → "with Protocol 3 being a two-step process".

      We rewrote this paragraph without mentioning Protocol 3, lines 37-41.

      (5) L82-89: This is the main statement of the problems that are being addressed here (speed and relaxing the need for global fragments). This could be moved up, emphasized, and made clearer without the long preamble and results on the engineering optimizations in v5. This lack of linearity in the narrative is also evident in the fact that after Figure 1a is cited, inline citations skip to Figure 2 before returning to Figure 1 once the contrastive learning is introduced.

      We have rewritten all the text until the contrastive learning, (old lines 20-89 are now lines 20-66). The text is shorter, more linear and easier to read.

      (6) L114: "pairs until the distance D_{pos}" → "pairs until the distance approximates D_{pos}".

      We rewrote as “ pairs until the distance 𝐷pos (or 𝐷neg) is reached” in line 107.

      (7) L570: Missing a right parenthesis in the equation.

      We no longer have this equation in the ms.

      (8) L705: "In order to identify fragments we, not only need" → "In order to identify fragments, we not only need".

      We applied this correction, Line 775.

      (9) L819: "probably distribution" → "probability distribution".

      We applied this correction, Line 776.

      (10) L833: "produced the best decrease the time required" → "produced the best decrease of the time required".

      We applied this correction, Line 746.

      Reviewer #3 (Recommendations for the authors):

      (1) We recommend rewriting and restructuring the manuscript. The paper includes a detailed explanation of the previous approaches (idTracker and idTracker.ai) and their limitations. In contrast, the description of the proposed method is short and unstructured, which makes it difficult to distinguish between the old and new methods as well as to understand the proposed method in general. Here are a few examples illustrating the problem. 

      (1.1) Only in line 90 do the authors start to describe the work done in this manuscript. The previous 3 pages list limitations of the original method.

      We have now divided the main text into sections, so it is clearer what is the previous method (“Limitation of the original idtracker.ai”, lines 28-51), the new optimization we did of this method (“Optimizing idtracker.ai without changes in the learning method”, lines 52-66) and the new contrastive approach that also includes the optimizations (“The new idtracker.ai uses representation learning”, lines 66-164). Also, the new text has now been streamlined until the contrastive section, following your suggestion. You can see that in the new writing the three sections are 25 , 15 and 99 lines. The more detailed section is the new system, the other two are needed as reference, to describe which problem we are solving and the extra new optimizations.  

      (1.2) The new method does not have a distinct name, and it is hard to follow which idtracker.ai is a specific part of the text referring to. Not naming the new method makes it difficult to understand.

      We use the name new idtracker.ai (v6) so it becomes the current default version. v5 is now obsolete, as well as v4. And from the point of view of the end user, no new name is needed since v6 is just an evolution of the same software they have been using. Also, we added sections in the main text to clarify the ideas in there and indicate the version of idtracker.ai we are referring to.

      (1.3) There are "Protocol 2" and "Protocol 3" mixed with various versions of the software scattered throughout the text, which makes it hard to follow. There should be some systematic naming of approaches and a listing of results introduced.

      Following this recommendation we no longer talk about the specific protocols of the old version of idtracker.ai in the main text. We rewritten the explanation of these versions in a more clear and straightforward way, lines 29-36.

      (2) To this end, the authors leave some important concepts either underexplained or only referenced indirectly via prior work. For example, the explanation of how the fragments are created (line 15) is only explained by the "video structure" and the algorithm that is responsible for resolving the identities during crossings is not detailed (see lines 46-47, 149-150). Including summaries of these elements would improve the paper's clarity and accessibility.

      We listed the specific sections from our previous publication where the reader can find information about the entire tracking pipeline (lines 539-549). This way, we keep the ms clear and focused on the new identification algorithm while indicating where to find such information.

      (3) Accuracy metrics are not clear. In line 319, the authors define it as based on "proportion of errors in the trajectory". This proportion is not explained. How is the error calculated if a trajectory is lost or there are identity swaps? Multi-object tracking has a range of accuracy metrics that account for such events but none of those are used by the authors. Estimating metrics that are common for MOT literature, for example, IDF1, MOTA, and MOTP, would allow for better method performance understanding and comparison.

      In the new ms, we replaced our accuracy metric with the standard metric IDF1. IDF1 is the standard metric that is applied to systems in which the goal is to maintain consistent identities across time. See also the section in Appendix 1 "Computation of tracking accuracy” explaining why IDF1 and not MOTA or MOTP is the adequate metric for a system that wants to give correct tracking by identification in time. See lines 416-436.

      Using IDF1 we obtain slightly higher accuracies for the idtracker.ai systems. This is the comparison of mean accuracy four our previous accuracy and the new one for the full trajectories:

      v4:   97.42% -> 98.24%

      v5:   99.41% -> 99.49%

      v6:   99.74% -> 99.82%

      trex: 97.89% -> 97.89%

      (4) Additionally, the authors distinguish between tracking with and without crossings, but do not provide statistics on the frequency of crossings per video. It is also unclear how the crossings are considered for the final output. Including information such as the frame rate of the videos would help to better understand the temporal resolution and the differences between consecutive frames of the videos.

      We added this information in the Appendix 1 “Benchmark of accuracy and tracking time”, lines 445-451. The framerate in our benchmark videos goes from 25 to 60 fps (average of 37 fps). On average 2.6% of the blobs are crossings (1.1% for zebrafish 0.7% for drosophila 9.4% for mice).

      (5) In the description of the dataset used for evaluation (lines 349-365), the authors describe the random sampling of parameter values for each tracking run. However, it is unclear whether the same values were used across methods. Without this clarification, comparisons between the proposed method, older versions, and TRex might be biased due to lucky parameter combinations. In addition, the ranges from which the values were randomly sampled were also not described.

      Only one parameter is shared between idtracker.ai and TRex: intensity_threshold (in idtracker.ai) and threshold (in TRex). Both are conceptually equivalent but differ in their numerical values since they affect different algorithms. V4, v5, and TRex each required the same process of independent expert visual inspection of the segmentation to select the valid value range. Since versions 5 and 6 use exactly the same segmentation algorithm, they share the same parameter ranges.

      All the ranges of valid values used in our benchmark are public here https://drive.google.com/drive/folders/1tFxdtFUudl02ICS99vYKrZLeF28TiYpZ as stated in the section “Data availability”, lines 227-228.

      (6) Lines 122-123, Figure 1c. "batches" - is an imprecise metric of training time as there is no information about the batch size.

      We clarified the Figure caption, new Figure 2c.

      (7) Line 145 - "we run some steps... For example..." leaves the method description somewhat unclear. It would help if you could provide more details about how the assignments are carried out and which metrics are being used.

      Following this recommendation, we listed the specific sections from our previous publication where the reader can find information about the entire tracking pipeline (lines 539-549). This way, we keep the ms clear and focused on the new identification algorithm while indicating where to find such information.

      (8) Figure 3. How is tracking accuracy assessed with occlusions? Are the individuals correctly recognized when they reappear from the occluded area?

      The groundtruth for this video contains the positions of all animals at all times. Only the groundtruth points inside the region of interest are taken into account when computing the accuracy. When the tracking reaches high accuracy, it means that animals are successfully relabeled every time they enter the non-masked region. Note that this software works all the time by identification of animals, so crossings and occlusion are treated the same way. What is new here is that the occlusions are so large that there are no global fragments. We clarified this in the new section “Occlusion tests” in Methods, lines 239-251.

      (9) Lines 185-187 this part of the sentence is not clear.

      We rewrote this part in a clearer way, lines 180-182.

      (10) The authors also highlight the improved runtime performance. However, they do not provide a detailed breakdown of the time spent on each component of the tracking/training pipeline. A timing breakdown would help to compare the training duration with the other components. For example, the calculation of the Silhouette Score alone can be time-consuming and could be a bottleneck in the training process. Including this information would provide a clearer picture of the overall efficiency of the method.

      We measured that the training of ResNet takes on average in our benchmark 47% of the tracking time (we added this information line 551 section “Network Architecture”). In this training stage the bottleneck becomes the network forward and backward pass, limited by the GPU performance. All other processes happening during training have been deeply optimized and parallelized when needed so their contribution to the training time is minimal. Apart from the training, we also measured 24.4% of the total tracking time spent in reading and segmenting the video files and 11.1% in processing the identification images and detecting crossings.

      (11) An important part of the computational cost is related to model training. It would be interesting to test whether a model trained on one video of a specific animal type (e.g., zebrafish_5) generalizes to another video of the same type (e.g., zebrafish_7). This would assess the model's generalizability across different videos of the same species and spare a lot of compute. Alternatively, instead of training a model from scratch for each video, the authors could also consider training a base model on a superset of images from different videos and then fine-tuning it with a lower learning rate for each specific video. This could potentially save time and resources while still achieving good performance.

      Already before v6, there was the possibility for the user to start training the identification network by copying the final weights from another tracking session. This knowledge transfer feature is still present in v6 and it still decreases the training times significatively. This information has been added in Appendix 4, lines 906-909.

      We have already begun working on the interesting idea of a general base model but it brings some complex challenges. It could be a very useful new feature for future idtracker.ai releases.

      We thank the reviewer for the many suggestions. We have implemented all of them.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #2 (Public review):

      (1) Vglut2 isn't a very selective promoter for the STN. Did the authors verify every injection across brain slices to ensure the para-subthalamic nucleus, thalamus, lateral hypothalamus, and other Vglut2-positive structures were never infected?

      The STN is anatomically well-confined, with its borders and the overlying zona incerta (composed of GABAergic neurons) providing protection against off-target expression in most neighboring forebrain regions. All viral injections were histologically verified and did not into extend into thalamic or hypothalamic areas. As described in the Methods, we employed an app we developed (Brain Atlas Analyzer, available on OriginLab) that aligns serial histological sections with the Allen Brain Atlas to precisely assess viral spread and confirm targeting accuracy. The experiments included in the revised manuscript now focus on optogenetic inhibition and irreversible lesion approaches—three complementary methods that consistently targeted the STN and yielded similar behavioral effects.

      (2) The authors say in the methods that the high vs low power laser activation for optogenetic experiments was defined by the behavioral output. This is misleading, and the high vs low power should be objectively stated and the behavioral results divided according to the power used, not according to the behavioral outcome.

      Optogenetic excitation is no longer part of the study.

      (3) In the fiber photometry experiments exposing mice to the range of tones, it is impossible to separate the STN response to the tone from the STN response to the movement evoked by the tone. The authors should expose the mouse to the tones in a condition that prevents movement, such as anesthetized or restrained, to separate out the two components.

      The new mixed-effects modeling approach clearly differentiates sensory (auditory) from motor contributions during tone-evoked STN activation. In prior work (see Hormigo et al, 2023, eLife), we explored experimental methods such as head restraint or anesthesia to reduce movement, but we concluded that these approaches are unsuitable for addressing this question. Mice exhibit substantial residual movement even when head-fixed, and anesthesia profoundly alters neural excitability and behavioral state, introducing major confounds. To fully eliminate movement would require paralysis and artificial ventilation, which would again disrupt physiological network dynamics and raise ethical concerns. Therefore, the current modeling approach—incorporating window-specific covariates for movement—is the most appropriate and rigorous way to dissociate tone-evoked sensory activity from motor activity in behaving animals.

      (4) The claim 'STN activation is ideally suited to drive active avoids' needs more explanation. This claim comes after the fiber photometry experiments during active avoidance tasks, so there has been no causality established yet.

      Text adjusted. 

      (5) The statistical comparisons in Figure 7E need some justification and/or clarification. The 9 neuron types are originally categorized based on their response during avoids, then statistics are run showing that they respond differently during avoids. It is no surprise that they would have significantly different responses, since that is how they were classified in the first place. The authors must explain this further and show that this is not a case of circular reasoning.

      Statistically verifying the clustering is useful to ensure that the selected number of clusters reflects distinct classes. It is also necessary when different measurements are used to classify (movement time series classified the avoids) and to compare neuronal types within each avoid mode/class (know called “mode”). Moreover, the new modeling approach goes beyond the prior statistical limitations related to considering movement and neuronal variables separately. 

      (6) The authors show that neurons that have strong responses to orientation show reduced activity during avoidance. What are the implications of this? The author should explain why this is interesting and important.

      The new modeling approach goes beyond the prior analysis limitations. For instance, it shows that most of the prior orienting related activations closely reflect the orienting movement, and only in a few cases (noted and discussed in the results) orienting activations are related to the behavioral contingencies or behavioral outcomes in the task. 

      (8) The experiments in Figure 10 are used to say that STN stimulation is not aversive, but they only show that STN stimulation cannot be used as punishment in place of a shock. This doesn't mean that it is not aversive; it just means it is not as aversive as a shock. The authors should do a simpler aversion test, such as conditioned or real-time place preference, to claim that STN stimulation is not aversive. This is particularly surprising as previous work (Serra et al., 2023) does show that STN stimulation is aversive.

      Optogenetic excitation is no longer part of the study. 

      (7) It is not clear which conditions each mouse experienced in which order. This is critical to the interpretation of Figure 9 and the reduction of passive avoids during STN stimulation. Did these mice have the CS1+STN stimulation pairing or the STN+US pairing prior to this experiment? If they did, the stimulation of the STN could be strongly associated with either punishment or with the CS1 that predicts punishment. If that is the case, stimulating the STN during CS2 could be like presenting CS1+CS2 at the same time and could be confusing.

      Optogenetic excitation is no longer part of the study. 

      (8) The experiments in Figure 10 are used to say that STN stimulation is not aversive, but they only show that STN stimulation cannot be used as punishment in place of a shock. This doesn't mean that it is not aversive; it just means it is not as aversive as a shock. The authors should do a simpler aversion test, such as conditioned or real-time place preference, to claim that STN stimulation is not aversive. This is particularly surprising as previous work (Serra et al., 2023) does show that STN stimulation is aversive.

      Optogenetic excitation is no longer part of the study.

      (9) In the discussion, the idea that the STN encodes 'moving away' from contralateral space is pretty vague and unsupported. It is puzzling that the STN activates more strongly to contraversive turns, but when stimulated, it evokes ipsiversive turns; however, it seems a stretch to speculate that this is related to avoidance. In the last experiments of the paper, the axons from the STN to the GPe and to the midbrain are selectively stimulated. Do these evoke ipsiversive turns similarly?

      Optogenetic excitation is no longer part of the study. 

      (10) In the discussion, the authors claim that the STN is essential for modulating action timing in response to demands, but their data really only show this in one direction. The STN stimulation reliably increases the speed of response in all conditions (except maximum speed conditions such as escapes). It seems to be over-interpreting the data to say this is an inability to modulate the speed of the task, especially as clear learning and speed modulation do occur under STN lesion conditions, as shown in Figure 12B. The mice learn to avoid and increase their latency in AA2 vs AA1, though the overall avoids and latency are different from controls. The more parsimonious conclusion would be that STN stimulation biases movement speed (increasing it) and that this is true in many different conditions.

      Optogenetic excitation is no longer part of the study.

      (11)  In the discussion, the authors claim that the STN projections to the midbrain tegmentum directly affect the active avoidance behavior, while the STN projections to the SNr do not affect it. This seems counter to their results, which show STN projections to either area can alter active avoidance behavior. What is the laser power used in these terminal experiments? If it is high (3mW), the authors may be causing antidromic action potentials in the STN somas, resulting in glutamate release in many brain areas, even when terminals are only stimulated in one area. The authors could use low (0.25mW) laser power in the terminals to reduce the chance of antidromic activation and spatially restrict the optical stimulation.

      Optogenetic excitation is no longer part of the study. 

      (12) Was normality tested for data prior to statistical testing?

      Yes, although now we use mixed models

      (13) Why are there no error bars on Figure 5B, black circles and orange triangles?

      When error bars are not visible, they are smaller than the trace thickness or bar line—for example, in Figure 5B, the black circles and orange triangles include error bars, but they are smaller than the symbol size.

      Reviewer #3 (Public review):

      (1) I really don't understand or accept this idea that delayed movement is necessarily indicative of cautious movements. Is the distribution of responses multi-modal in a way that might support this idea, or do the authors simply take a normal distribution and assert that the slower responses represent 'caution'? Even if responses are multi-modal and clearly distinguished by 'type', why should readers think this that delayed responses imply cautious responding instead of say: habituation or sensitization to cue/shock, variability in attention, motivation, or stress; or merely uncertainty which seems plausible given what I understand of the task design where the same mice are repeatedly tested in changing conditions. This relates to a major claim (i.e., in the work's title).

      In our study, “caution” is defined operationally as the tendency to delay initiation of an avoidance response in demanding situations (e.g., taking more time or care before crossing a busy street). The increase in avoidance latency with task difficulty is highly robust, as we have shown previously through detailed analyses of timing distributions and direct comparisons with appetitive behaviors (e.g., Zhou et al., 2022 JNeurosci). Moreover, we used the tracked movement time series to statistically classify responses into cautious modes, which is likely novel. This definition can dissociate cautious responding from broader constructs listed by a reviewer, such as attention, motivation, or stress, which must be explicitly defined to be rigorously considered in this context, including the likelihood that they covary with caution without being equivalent to it. 

      Cue-evoked orienting responses at CS onset are directly measured, and their habituation and sensitization have been characterized in our prior work (e.g., Zhou et al., 2023 JNeurosci). US-evoked escapes are also measured in the present study and directly compared with avoidance responses. Together, these analyses provide a rigorous and consistent framework for defining and quantifying caution within our behavioral procedures.

      Importantly, mice exhibit cautious responding as defined here across different tasks, making it more informative to classify avoidance responses by behavioral mode rather than by task alone. Accordingly, in the miniscope, single-neuron, and mixed-effects model analyses, we classified active avoids into distinct modes reflecting varying levels of caution. Although these modes covary with task contingencies, their explicit classification improves model predictability and interpretability with respect to cautious responding.

      (2) Related to the last, I'm struggling to understand the rationale for dividing cells into 'types' based the their physiological responses in some experiments (e.g., Figure 7).

      This section has now been expanded into 3 figures (Fig. 7-9) with new modeling approaches that should make the rationale more straight forward.

      By emphasizing the mixed-effects modeling results and integrating these analyses directly into the figures, the revised manuscript now more clearly delineates what is encoded at the population and single-neuron levels. Including movement and baseline covariates allowed us to dissociate motor-related modulation from other neural signals, substantially clarifying the distinction between movement encoding and other task-related variables, which we focus on in the paper. These analyses confirm the strong role of the STN in representing movement while revealing additional signals related to aversive stimulation and cautious responding that persist after accounting for motor effects. These signals arise from distinct neuronal populations that can be differentiated by their movement sensitivity and activation patterns across avoidance modes, reflecting varying levels of caution. At the same time, several effects that initially reflected orienting-related activity at CS-onset (note that our movement tracking captures both head position and orientation as a directional vector) dissipated once movement and baseline covariates were included in the models, emphasizing the utility of the analytical improvements in the revision.

      (3)The description and discussion of orienting head movements were not well supported, but were much discussed in the avoidance datasets. The initial speed peaks to cue seem to be the supporting data upon which these claims rest, but nothing here suggests head movement or orientation responses.

      As described in the methods (and noted above), we track the head and decompose the movement into rotational and translational components. With the new approach, several effects that initially reflected orienting-related activity at CS-onset (note that our movement tracking captures both head position and orientation as a directional vector) dissipated once movement and baseline covariates were included in the models, emphasizing the utility of the analytical improvements in the revision.

      (4) Similar to the last, the authors note in several places, including abstract, the importance of STN in response timing, i.e., particularly when there must be careful or precise timing, but I don't think their data or task design provides a strong basis for this claim.

      The avoidance modes and the measured latencies directly support the relation to action timing, but now the portion of the previous paper about optogenetic excitation and apparently the main source of criticism is no longer in the present study. 

      (5) I think that other reports show that STN calcium activity is recruited by inescapable foot shock as well. What do these authors see? Is shock, independent of movement, contributing to sharp signals during escapes?

      The question, “Is shock, independent of movement, contributing to sharp signals during escapes?” is now directly addressed in the revised analyses. By incorporating movement and baseline covariates into the mixed-effects models, we dissociate STN activity related to aversive stimulation from that associated with motor output. The results show that shock-evoked STN activation persists even after controlling for movement within defined neuronal populations, supporting a specific nociceptive contribution independent of motor dynamics—a dissociation that appears to be new in this field.

      (6) In particular, and related to the last point, the following work is very relevant and should be cited:  Note that the focus of this other paper is on a subset of VGLUT2+ Tac1 neurons in paraSTN, but using VGLUT2-Cre to target STN will target both STN and paraSTN.

      We appreciate the reviewer’s reference to the recent preprint highlighting the role of the para-subthalamic nucleus in avoidance learning. However, our study focused specifically on performance in well-trained mice rather than on learning processes. Behavioral learning is inherently more variable and can be disrupted by less specific manipulations, whereas our experiments targeted the stable execution of learned avoidance behaviors. Future work will extend these findings to the learning phase and examine potential contributions of subthalamic subdivisions, which our current Vglut2-based manipulations do not dissociate. We will consider this and related work more closely in those studies.

      (7) In multiple other instances, claims that were more tangential to the main claims were made without clearly supporting data or statistics. E.g., claim that STN activation is related to translational more than rotational movement; claim that GCaMP and movement responses to auditory cues were small; claims that 'some animals' responded differently without showing individual data.

      We have adjusted the text accordingly.

      (8) In several figures, the number of subjects used was not described. This is necessary. Also necessary is some assessment of the variability across subjects. The only measure of error shown in many figures relates to trial-to-trial or event variability, which is minimal because, in many cases, it appears that hundreds of trials may have been averaged per animal, but this doesn't provide a strong view of biological variability. When bar/line plots are used to display data, I recommend showing individual animals where feasible.

      All experiments report number of mice and sessions. Wherever feasible, we display individual data points (e.g., Figures 1 and 2) to convey variability directly. However, in cases where figures depict hundreds of paired (repeated-measures) data points, showing all points without connecting them would not be appropriate, while linking them would make the figures visually cluttered and uninterpretable. All plots and traces include measures of variability (SEM), and the raw data will be shared on Dryad. When error bars are not visible, they are smaller than the trace thickness or bar line—for example, in Figure 5B, the black circles and orange triangles include error bars, but they are smaller than the symbol size.

      Also, to minimize visual clutter, only a subset of relevant comparisons is highlighted with asterisks, whereas all relevant statistical results, comparisons, and mouse/session numbers are fully reported in the Results section, with statistical analyses accounting for the clustering of data within subjects and sessions.

      (9) Can the authors consider the extent to which calcium imaging may be better suited to identify increases compared to decreases and how this may affect the results, particularly related to the GRIN data when similar numbers of cells show responses in both directions (e.g., Figure 3)?

      This is an interesting issue related to a widely used technique beyond the scope of our study.

      (10) Raw example traces are not provided.

      We do not think raw traces are useful here. All figures contain average traces to reflect the activity of the estimated population.

      (11) The timeline of the spontaneous movement and avoidance sessions was not clear, nor was the number of events or sessions per animal nor how this was set. It is not clear if there was pre-training or habituation, if many or variable sessions were combined per animal, or what the time gaps between sessions were, or if or how any of these parameters might influence interpretation of the results.

      We have enhanced the description of the sessions, including the number of animals and sessions, which are daily and always equal per animals in each group of experiments. As noted, the sessions are part of the random effects in the model.

      (12) It is not clear if or how the spread of expression outside of the target STN was evaluated, and if or how many mice were excluded due to spread or fiber placements.

      The STN is anatomically well-confined, with its borders and the overlying zona incerta (composed of GABAergic neurons) providing protection against off-target expression in most neighboring forebrain regions. All viral injections were histologically verified and did not into extend into thalamic or hypothalamic areas. As described in the Methods, we employed an app we developed (Brain Atlas Analyzer, available on OriginLab) that aligns serial histological sections with the Allen Brain Atlas to precisely assess viral spread and confirm targeting accuracy. The experiments included in the revised manuscript now focus on optogenetic inhibition and irreversible lesion approaches—three complementary methods that consistently targeted the STN and yielded similar behavioral effects.

      Recommendations for the authors:

      Reviewing Editor Comments:

      The primary feedback agreed upon by all the reviewers was that the manuscript requires significant streamlining as it is currently overly long and convoluted.

      We thank the reviewers and editors for their thoughtful and constructive feedback. In response to the primary comment that “the manuscript requires significant streamlining as it is currently overly long and convoluted,” we have substantially revised and refocused the paper. Specifically, we streamlined the included data and enhanced the analyses to emphasize the central findings: the encoding of movement, cautious responding, and punishment in the STN during avoidance behavior. We also focused the causal component of the study by including only the loss-of-function experiments—both optogenetic inhibition and irreversible viral/electrolytic lesions—that establish the critical role of STN circuits in generating active avoidance. Together, these revisions enhance clarity, tighten the narrative focus, and align the manuscript more closely with the reviewers’ recommendations.

      Major revisions include the addition of mixed-effects modeling to dissociate the contributions of movement from other STN-encoded signals related to caution and punishment. This modeling approach allowed us to reveal that these components are statistically separable, demonstrating that movement, cautious responding, and aversive input are encoded by neuronal subsets. To streamline the manuscript and address reviewer concerns, we removed the optogenetic excitation experiments. As revised, the paper presents a more concise and cohesive narrative showing that STN neurons differentially encode movement, caution, and aversive stimuli, and that this circuitry is essential for generating active avoidance behavior.

      Many of the specific points raised by reviewers now fall outside the scope of the revised manuscript. This is primarily because the revised version omits data and analyses related to optogenetic excitation and associated control experiments. By removing these components, the paper now presents a streamlined and internally consistent dataset focused on how the STN encodes movement, cautious responding, and aversive outcomes during avoidance behavior, as well as on loss-of-function experiments demonstrating its necessity for generating active avoidance. Below, we address the points that remain relevant across reviews.

      Following extensive revisions, the current manuscript differs in several important ways from what the assessment describes:

      The description that the study “uses fiber photometry, implantable lenses, and optogenetics” is more accurately represented as using both fiber photometry and singleneuron calcium imaging with miniscopes, combined with optogenetic and irreversible lesion approaches.

      The phrase stating that “active but not passive avoidance depends in part on STN projections to substantia nigra” is better characterized as “STN projections to the midbrain,” since our data show that optogenetic inhibition of STN terminals in both the mesencephalic reticular tegmentum (MRT) and substantia nigra pars reticulata (SNr) produce equivalent effects, and thus these sites are combined in the study. 

      Finally, the original concern that evidence for STN involvement in cautious responding or avoidance speed was incomplete no longer applies. The revised focus on encoding, through the inclusion of mixed-effects modeling, now dissociates movement-related, cautious, and aversive components of STN activity. By removing the optogenetic excitation data, we no longer claim that the STN controls caution but rather that it encodes cautious responding, alongside movement and punishment signals. Furthermore, loss-of-function experiments demonstrate that silencing STN output abolishes active avoidance entirely, supporting an essential role for the STN in generating goal-directed avoidance behavior—a behavioral domain that, unlike appetitive responding, is fundamentally defined by caution and the need to balance action timing under threat.

      Reviewer #2 (Recommendations for the authors):

      (1) Show individual data points on bar plots.

      Wherever feasible, we display individual data points (e.g., Figures 1 and 2) to convey variability directly. However, in cases where figures depict hundreds of paired (repeatedmeasures) data points, showing all points without connecting them would not be appropriate, while linking them would make the figures visually cluttered and uninterpretable. All plots and traces include measures of variability (SEM), and the raw data will be shared on Dryad. When error bars are not visible, they are smaller than the trace thickness or bar line—for example, in Figure 5B, the black circles and orange triangles include error bars, but they are smaller than the symbol size.

      Also, to minimize visual clutter, only a subset of relevant comparisons is highlighted with asterisks, whereas all relevant statistical results, comparisons, and mouse/session numbers are fully reported in the Results section, with statistical analyses accounting for the clustering of data within subjects and sessions.

      (2) The active avoidance experiments are confusing when they are introduced in the results section. More explanation of what paradigms were used and what each CS means at the time these are introduced would add clarity. For example, AA1, AA2, etc, are explained only with references to other papers, but a brief description of each protocol and a schematic figure would really help.

      The avoidance protocols (AA1–4) are now described briefly but clearly in the Results section (second paragraph of “STN neurons activate during goal-directed avoidance contingencies”) and in greater detail in the Methods section. As stated, these tasks were conducted sequentially, and mice underwent the same number of sessions per procedure, which are indicated. All relevant procedural information has been included in these sections. Mice underwent daily sessions and learnt these tasks within 1-2 sessions, progressing sequentially across tasks with an equal number of sessions per task (7 per task), and the resulting data were combined and clustered by mouse/session in the statistical models.

      (3) How do the Class 1, 2, 3 avoids relate to Class 1, 2, 3 neural types established in Figure 3? It seems like they are not related, and if that is the case, they should be named something different from each other to avoid confusion. (4) Similarly, having 3 different cell types (a,b,c) in the active avoidance seems unrelated to the original classification of cell types (1,2,3), and these are different for each class of avoid. This is very confusing, and it is unclear how any of these types relate to each other. Presumably, the same mouse has all three classes of avoids, so there are recordings from each cell during each type of avoid.

      The terms class, mode, and type are now clearly distinguished throughout the manuscript. Modes refer to distinct patterns of avoidance behavior that differ in the level of cautious responding (Mode 3 is most cautious). Within each mode, types denote subgroups of neurons identified based on their ΔF/F activity profiles. In contrast, classes categorize neurons according to their relationship to movement, determined by cross-correlation analyses between ΔF/F and head speed (Class1-4; Fig. 7 is a new analysis) or head turns (ClassA-C, renamed from 1-3). This updated terminology clarifies the analytic structure, highlighting distinct neuronal populations within each analysis. For example, during avoidance behaviors, these classifications distinguish neurons encoding movement-, caution-, and outcome-related signals. Comparisons are conducted within each analytical set, within classes (A-C or 1-4 separately), within avoidance modes, or within modespecific neuronal types.

      …So the authors could compare one cell during each avoid and determine whether it relates to movement or sound, or something else. It is interesting that types a,b, and c have the exact same proportions in each class of avoid, and makes it important to investigate if these are the exact same cells or not.

      That previous table with the a,b,c % in the three figure panels was a placeholder, which was not updated in the included figure. It has now been correctly updated. They do not have the same proportions as shown in Fig. 9, although they are similar.

      Also, these mice could be recorded during the open field, so the original neural classification (class 1, 2,3) could be applied to these same cells, and then the authors can see whether each cell type defined in the open field has a different response to the different avoid types. As it stands, the paper simply finds that during movement and during avoidance behaviors, different cells in the STN do different things.

      We included a new analysis in Fig. 7 that classifies neurons based on the cross-correlation with movement. The inclusion of the models now clearly assigns variance to movement versus the other factors, and this analysis leads to the classification based on avoid modes. 

      (5) The use of the same colors to mean two different things in Figure 9 is confusing. AA1 vs AA2 shouldn't be the same colors as light-naïve vs light signaling CS.

      Optogenetic excitation is no longer part of the study.

      (6) The exact timeline of the optogenetics experiments should be presented as a schematic for understanding. It is not clear which conditions each mouse experienced in which order. This is critical to the interpretation of Figure 9 and the reduction of passive avoids during STN stimulation. Did these mice have the CS1+STN stimulation pairing or the STN+US pairing prior to this experiment? If they did, the stimulation of the STN could be strongly associated with either punishment or with the CS1that predicts punishment. If that is the case, stimulating the STN during CS2 could be like presentingCS1+CS2 at the same time and could be confusing. The authors should make it clear whether the mice were naïve during this passive avoid experiment or whether they had experienced STN stimulation paired with anything prior to this experiment.

      Optogenetic excitation is no longer part of the study.

      (20) Similarly, the duration of the STN stimulation should be made clear on the plots that show behavior over time (e.g., Figure 9E).

      Optogenetic excitation is no longer part of the study.

      (21) There is just so much data and so many conditions for each experiment here. The paper is dense and difficult to read. It would really benefit readability if the authors put only the key experiments and key figure panels in the main text and moved much of the repetitive figure panels to supplemental figures. The addition of schematic drawings for behavioral experiment timing and for the different AA1, AA2, and AA3 conditions would also really improve clarity.

      By focusing the study, we believe it has substantially improved clarity and readability. 

      Reviewer #3 (Recommendations for the authors):

      (1) Minor error in results 'Cre-AAV in the STN of Vglut2-Cre' Fixed.

      (2) In some Figure 2 panels, the peaks appear to be cut off, and blue traces are obscured by red.

      In Fig. 2, the peaks of movement (speed) traces are intentionally truncated to emphasize the rising phase of the turn, which would otherwise be obscured if the full y-axis range were displayed (peaks and other measures are statistically compared). This adjustment enhances clarity without omitting essential detail and is now noted in the legend.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Artiushin et al. establish a comprehensive 3D atlas of the brain of the orb-web building spider Uloborus diversus. First, they use immunohistochemistry detection of synapsin to mark and reconstruct the neuropils of the brain of six specimens and they generate a standard brain by averaging these brains. Onto this standard 3D brain, they plot immunohistochemical stainings of major transmitters to detect cholinergic, serotonergic, octopaminergic/taryminergic and GABAergic neurons, respectively. Further, they add information on the expression of a number of neuropeptides (Proctolin, AllatostatinA, CCAP, and FMRFamide). Based on this data and 3D reconstructions, they extensively describe the morphology of the entire synganglion, the discernible neuropils, and their neurotransmitter/neuromodulator content.

      Strengths:

      While 3D reconstruction of spider brains and the detection of some neuroactive substances have been published before, this seems to be the most comprehensive analysis so far, both in terms of the number of substances tested and the ambition to analyze the entire synganglion. Interestingly, besides the previously described neuropils, they detect a novel brain structure, which they call the tonsillar neuropil.<br /> Immunohistochemistry, imaging, and 3D reconstruction are convincingly done, and the data are extensively visualized in figures, schemes, and very useful films, which allow the reader to work with the data. Due to its comprehensiveness, this dataset will be a valuable reference for researchers working on spider brains or on the evolution of arthropod brains.

      Weaknesses:

      As expected for such a descriptive groundwork, new insights or hypotheses are limited, apart from the first description of the tonsillar neuropil. A more comprehensive labeling in the panels of the mentioned structures would help to follow the descriptions. The reconstruction of the main tracts of the brain would be a very valuable complementary piece of data.

      Reviewer #2 (Public review):

      Summary

      Artiushin et al. created the first three-dimensional atlas of a synganglion in the hackled orb-weaver spider, which is becoming a popular model for web-building behavior. Immunohistochemical analysis with an impressive array of antisera reveals subcompartments of neuroanatomical structures described in other spider species as well as two previously undescribed arachnid structures, the protocerebral bridge, hagstone, and paired tonsillar neuropils. The authors describe the spider's neuroanatomy in detail and discuss similarities and differences from other spider species. The final section of the discussion examines the homology between onychophoran and chelicerate arcuate bodies and mandibulate central bodies.

      Strengths

      The authors set out to create a detailed 3D atlas and accomplished this goal.

      Exceptional tissue clearing and imaging of the nervous system reveal the three-dimensional relationships between neuropils and some connectivity that would not be apparent in sectioned brains.

      A detailed anatomical description makes it easy to reference structures described between the text and figures.

      The authors used a large palette of antisera which may be investigated in future studies for function in the spider nervous system and may be compared across species.

      Weaknesses

      It would be useful for non-specialists if the authors would introduce each neuropil with some orientation about its function or what kind of input/output it receives, if this is known for other species. Especially those structures that are not described in other arthropods, like the opisthosomal neuropil. Are there implications for neuroanatomical findings in this paper on the understanding of how web-building behaviors are mediated by the brain?

      Likewise, where possible, it would be helpful to have some discussion of the implications of certain neurotransmitters/neuropeptides being enriched in different areas. For example, GABA would signal areas of inhibitory connections, such as inhibitory input to mushroom bodies, as described in other arthropods. In the discussion section on relationships between spider and insect midline neuropils, are there similarities in expression patterns between those described here and in insects?

      Reviewer #3 (Public review):

      Summary:

      This is an impressive paper that offers a much-needed 3D standardized brain atlas for the hackled-orb weaving spider Uloborus diversus, an emerging organism of study in neuroethology. The authors used a detailed immunohistological whole-mount staining method that allowed them to localize a wide range of common neurotransmitters and neuropeptides and map them on a common brain atlas. Through this approach, they discovered groups of cells that may form parts of neuropils that had not previously been described, such as the 'tonsillar neuropil', which might be part of a larger insect-like central complex. Further, this work provides unique insights into the previously underappreciated complexity of higher-order neuropils in spiders, particularly the arcuate body, and hints at a potentially important role for the mushroom bodies in vibratory processing for web-building spiders.

      Strengths:

      To understand brain function, data from many experiments on brain structure must be compiled to serve as a reference and foundation for future work. As demonstrated by the overwhelming success in genetically tractable laboratory animals, 3D standardized brain atlases are invaluable tools - especially as increasing amounts of data are obtained at the gross morphological, synaptic, and genetic levels, and as functional data from electrophysiology and imaging are integrated. Among 'non-model' organisms, such approaches have included global silver staining and confocal microscopy, MRI, and, more recently, micro-computed tomography (X-ray) scans used to image multiple brains and average them into a composite reference. In this study, the authors used synapsin immunoreactivity to generate an averaged spider brain as a scaffold for mapping immunoreactivity to other neuromodulators. Using this framework, they describe many previously known spider brain structures and also identify some previously undescribed regions. They argue that the arcuate body - a midline neuropil thought to have diverged evolutionarily from the insect central complex - shows structural similarities that may support its role in path integration and navigation.

      Having diverged from insects such as the fruit fly Drosophila melanogaster over 400 million years ago, spiders are an important group for study - particularly due to their elegant web-building behavior, which is thought to have contributed to their remarkable evolutionary success. How such exquisitely complex behavior is supported by a relatively small brain remains unclear. A rich tradition of spider neuroanatomy emerged in the previous century through the work of comparative zoologists, who used reduced silver and Golgi stains to reveal remarkable detail about gross neuroanatomy. Yet, these techniques cannot uncover the brain's neurochemical landscape, highlighting the need for more modern approaches-such as those employed in the present study.

      A key insight from this study involves two prominent higher-order neuropils of the protocerebrum: the arcuate body and the mushroom bodies. The authors show that the arcuate body has a more complex structure and lamination than previously recognized, suggesting it is insect central complex-like and may support functions such as path integration and navigation, which are critical during web building. They also report strong synapsin immunoreactivity in the mushroom bodies and speculate that these structures contribute to vibratory processing during sensory feedback, particularly in the context of web building and prey localization. These findings align with prior work that noted the complex architecture of both neuropils in spiders and their resemblance (and in some cases greater complexity) compared to their insect counterparts. Additionally, the authors describe previously unrecognized neuropils, such as the 'tonsillar neuropil,' whose function remains unknown but may belong to a larger central complex. The diverse patterns of neuromodulator immunoreactivity further suggest that plasticity plays a substantial role in central circuits.

      Weaknesses:

      My major concern, however, is that some of the authors' neuroanatomical descriptions rely too heavily on inference rather than what is currently resolvable from their immunohistochemistry stains alone.

      We would like to thank the reviewers for their time and effort in carefully reading our manuscript and providing helpful feedback, and particularly for their appreciation and realistic understanding of the scope of this study and its context within the existing spider neuroanatomical literature.

      Regarding the limitations and potential additions to this study, we believe these to be well-reasoned and are in agreement. We plan to address some of these shortcomings in future publications.

      As multiple reviewers remarked, a mapping of the major tracts of the brain would be a welcome addition to understanding the neuroanatomy of U. diversus. This is something which we are actively working on and hope to provide in a forthcoming publication. Given the length of this paper as is, we considered that a treatment of the tracts would be better served as an additional paper. Likewise, mapping of the immunoreactive somata of the currently investigated targets is a component which we would like to describe as part of a separate paper, keeping the focus of the current one on neuropils, in order to leverage our aligned volumes to describe co-expression patterns, which is not as useful for the more widely dispersed somata. Furthermore, while we often see somata through immunostaining, the presence and intensity of the signal is variable among immunoreactive populations. We are finding that these populations are more consistently and comprehensively revealed thru fluorescent in situ hybridization.

      We appreciate the desire of the reviewers for further information regarding the connectivity and function of the described neuropils, and where possible we have added additional statements and references. That being said, where this context remains sparse is largely a reflection of the lack of information in the literature. This is particularly the case for functional roles for spider neuropils, especially higher order ones of the protocerebrum, which are essentially unexamined. As summarized in the quite recent update to Foelix’s Spider Neuroanatomy, a functional understanding for protocerebral neuropil is really only available for the visual pathway. Consequently, it is therefore also difficult to speak of the implications for presence or absence of particular signaling elements in these neuropils, if no further information about the circuitry or behavioral correlates are available. Finally, multiple reviewers suggested that it might be worthwhile to explore a comparison of the arcuate body layer innervation to that of the central bodies of insects, of which there is a richer literature. This is an idea which we were also initially attracted to, and have now added some lines to the discussion section. Our position on this is a cautious one, as a series of more recent comparative studies spanning many insect species using the same antibody, reveals a considerable amount of variation in central body layering even within this clade, which has given us pause in interpreting how substantive similarities and differences to the far more distant spiders would be. Still, this is an interesting avenue which merits an eventual comprehensive analysis, one which would certainly benefit from having additional examples from more spider species, in order to not overstate conclusions based on the currently limited neuroanatomical representation.

      Given our framing for the impetus to advance neuroanatomical knowledge in orb-web builders, the question of whether the present findings inform the circuitry controlling web-building is one that naturally follows. While we are unable with this dataset alone to define which brain areas mediate web-building - something which would likely be beyond any anatomical dataset lacking complementary functional data – the process of assembling the atlas has revealed structures and defined innervation patterns in previously ambiguous sectors of the spider brain, particularly in the protocerebrum. A simplistic proposal is that such regions, which are more conspicuous by our techniques and in this model species, would be good candidates for further inquiries into web-building circuitry, as their absence or oversight in past work could be attributable to the different behavioral styles of those model species. Regardless, granted that such a hypothesis cannot be readily refuted by the existing neuroanatomical literature, underscores the need to have more finely refined models of the spider brain, to which we hope that we have positively contributed to and are gratified by the reviewer’s enthusiasm for the strengths of this study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Brenneis 2022 has done a very nice and comprehensive study focused on the visual system - this might be worth including.

      Thank you, we have included this reference on Line 34.

      (2) L 29: When talking about "connectivity maps", the emerging connectomes based on EM data could be mentioned.

      Additional references have been added, thank you. Line 35.

      (3) L 99: Please mention that you are going to describe the brain from ventral to dorsal.

      Thank you, we have added a comment to Line 99.

      (4) L 13: is found at the posterior.

      Thank you, revised.

      (5) L 168: How did you pick those two proctolin+ somata, given that there is a lot of additional punctate signal?

      Although not visible in this image, if you scroll through the stack there is a neurite which extends from these neurons directly to this area of pronounced immunoreactivity.

      (6) Figure 1: Please add the names of the neuropils you go through afterwards.

      We have added labels for neuropils which are recognizable externally.

      (7) Figure 1 and Figure 5: Please mark the esophagus.

      Label has now been added to Figure 1. In Figure 5, the esophagus should not really be visible because these planes are just ventral to its closure.

      (8) Figure 5A: I did not see any CCAP signal where the arrow points to; same for 5B (ChAT).

      In hindsight, the CCAP point is probably too minor to be worth mentioning, so we have removed it.

      The ChAT signal pattern in 5B has been reinforced by adding a dashed circle to show its location as well.

      (9) L 249: Could the circular spot also be a tract (many tracts lack synapsin - at least in insects)?

      Yes, thank you for pointing this out – the sentence is revised (L274). We are currently further analyzing anti-tubulin volumes and it seem that indeed there are tracts which occupy these synapsin-negative spaces, although interestingly they do not tend to account for the entire space.

      (10) L 302: Help me see the "conspicuous" thing.

      Brace added to Fig. 8B, note in caption.

      (11) L 315: Please first introduce the number of the eyes and how these relate to 1{degree sign} and 2{degree sign} pathway. Are these separate pathways from separate eyes or two relay stations of one visual pathway?

      We have expanded the introduction to this section (L336). Yes, these are considered as two separate visual pathways, with a typical segregation of which eyes contribute to which pathway – although there is evidence for species-specific differences in these contributions. In the context of this atlas, we are not currently able to follow which eyes are innervating which pathway.

      (12) L 343: It seems that the tonsillar neuropil could be midline spanning (at least this is how I interpret the signal across the midline). Would it make sense to re-formulate from a paired structure to midline-spanning? Would that make it another option for being a central complex homolog?

      In the spectrum from totally midline spanning and unpaired (e.g., arcuate body (at least in adults)) to almost fully distinct and paired (e.g., mushroom bodies (although even here there is a midline spanning ‘bridge’)), we view the tonsillar to be more paired due to the oval components, although it does have a midline spanning section, particularly unambiguous just posterior to the oval sections.

      Regarding central complex homology, if the suggestion is that the tonsillar with its midline spanning component could represent the entire central complex, then this is a possibility, but it would neglect the highly innervated and layered arcuate body, which we think represent a stronger contender – at least as a component of the central complex. For this reason, we would still be partial to the possibility that the tonsillar is a part of the central complex, but not the entire complex.

      (13) L 407: ...and dorsal (..) lobe...

      Added the word ‘lobe’ to this sentence (L429).

      (14) L 620ff: Maybe mention the role of MBs in learning and memory.

      A reference has been added at L661.

      (15) L 644: In the context of arcuate body homology with the central body, I was missing a discussion of the neurotransmitters expressed in the respective parts in insects. Would that provide additional arguments?

      This is an interesting comparison to explore, and is one that we initially considered making as well. There are certainly commonalities that one could point to, particularly in trying to build the case of whether particular lobes of the arcuate body are similar to the fan-shaped or ellipsoid bodies in insects. Nevertheless, something which has given us pause is studying the more recent comparative works between insect species (Timm et al., 2021, J Comp Neuro, Homberg et al., 2023, J Comp Neuro), which also reveal a fair degree of heterogeneity in expression patterns between species – and this is despite the fact that the neuropils are unambiguously homologous. When comparing to a much more evolutionarily distant organism such as the spider, it becomes less clear which extant species should serve as the best point of comparison, and therefore we fear making specious arguments by focusing on similarities when there are also many differences. We have added some of these comments to the discussion (L699-725).

      Throughout the text, I frequently had difficulties in finding the panels right away in the structures mentioned in the text. It would help to number the panels (e.g., 6Ai, Aii, Aii,i etc) and refer to those in the text. Further, all structures mentioned in the text should be labelled with arrows/arrowheads unless they are unequivocally identified in the panel

      Thank you for the suggestion. We have adopted the additional numbering scheme for panels, and added additional markers where suggested.

      Reviewer #2 (Recommendations for the authors):

      (1) L 18: "neurotransmitter" should be pluralized.

      Thank you, revised (L18).

      (2) L 55: Missing the word "the" before "U. diversus".

      Thank you, revised (L57).

      (3) L 179: Change synaptic dense to "synapse-dense".

      Thank you, revised (L189).

      (4) L 570: "present in" would be clearer than "presented on in".

      Our intention here was to say that Loesel et al did not show slices from the subesophageal mass for CCAP, so it was ambiguous as to whether it had immunoreactivity there but they simply did not present it, or if it indeed doesn’t show signal in the subesophageal. But agreed, this is awkward phrasing which has been revised (L606-608), thank you.

      (5) L 641: It would be worth noting that the upper and lower central bodies are referred to as the fan-shaped and ellipsoid bodies in many insects.

      Thank you, this has been added in L694.

      (6) L 642: Although cited here regarding insect central body layers, Strausfeld et al. 2006 mainly describe the onychophoran brain and the evolutionary relationship between the onychophoran and chelicerate arcuate bodies. The phylogenetic relationships described here would strengthen the discussion in the section titled "A spider central complex?"

      The phylogenetic relationship of onychophorans and chelicerates remains controversial and therefore we find it tricky to use this point to advance the argument in that discussion section, as one could make opposing arguments. The homology of the arcuate body (between chelicerates, onychophorans, and mandibulates) has likewise been argued over, with this Strausfeld et al paper offering one perspective, while others are more permissive (good summary at end of Doeffinger et al., 2010). Our thought was simply to draw attention to grossly similar protocerebral neuropils in examples from distantly related arthropods, without taking a stance, as our data doesn’t really deeply advance one view over the other.

      (7) L 701- Noduli have been described in stomatopods (Thoen et al., Front. Behav. Neurosci., 2017).

      This is an important addition, thank you – it has been incorporated and cited (L766).

      (8) Antisera against DC0 (PKA-C alpha) may distinguish globuli cells from other soma surrounding the mushroom bodies, but this may be accomplished in future studies.

      Agreed, this is something we have been interested in, but have not yet acquired the antibody.

      Reviewer #3 (Recommendations for the authors):

      Overall, this paper is both timely and important. However, it may face some resistance from classically trained arthropod neuroanatomists due to the authors' reliance on immunohistochemistry alone. A method to visualize fiber tracts and neuropil morphology would have been a valuable and grounding complement to the dataset and can be added in future publications. Tract-tracing methods (e.g., dextran injections) would strengthen certain claims about connectivity - particularly those concerning the mushroom bodies. For delineating putative cell populations across regions, fluorescence in situ hybridization for key transcripts would offer convincing evidence, especially in the context of the arcuate body, the tonsillar neuropil, and proposed homologies to the insect central complex.

      That said, the dataset remains rich and valuable. Outlined below are a number of issues the authors may wish to address. Most are relatively minor, but a few require further clarification.

      (1) Abstract

      (a) L 12-14: The authors should frame their work as a novel contribution to our understanding of the spider brain, rather than solely as a tool or stepping stone for future studies. The opening sentences currently undersell the significance of the study.

      Thank you for your encourament! We have revised the abstract.

      (b) Rather than touting "first of its kind" in the abstract, state what was learned from this.

      Thank you, we have revised the abstract.

      (c) The abstract does not mention the major results of the study. It should state which brain regions were found. It should list all of the peptides and transmitters that were tested so that they can be discoverable in searches.

      Thank you, revised.

      (2) Introduction

      (a) L 38: There's a more updated reference for Long (2016): Long, S. M. (2021). Variations on a theme: Morphological variation in the secondary eye visual pathway across the order of Araneae. Journal of Comparative Neurology, 529(2), 259-280.

      Thank you, this has been updated (L41 and elsewhere).

      (b) L 47: While whole-mount imaging offers some benefits, a downside is the need for complete brain dissection from the cuticle, which in spiders likely damages superficial structures (such as the secondary eye pathways).

      True – we have added this caveat to the section (L48-51).

      (c) L 49-52: If making this claim, more explicit comparisons with non-web building C. saeli in terms of neuropil presence, volume, or density later in the paper would be useful.

      We do not have the data on hand to make measured comparisons of C. salei structures, and the neuropils identified in this study are not clearly identifiable in the slices provided in the literature, so would likely require new sample preparations. We’ve removed the reference to proportionality and softened this sentence slightly – we are not trying to make a strong claim, but simply state that this is a possibility.

      (3) Results

      (a) The authors should state how they accounted for autofluorescence.

      While we did not explicitly test for autofluorescence, the long process of establishing a working whole-mount immuno protocol and testing antibodies produced many examples of treated brains which did not show any substantial signal.  We have added a note to the methods section (L866).

      (b) L 69: There is some controversy in delineating the subesophageal and supraesophageal mass as the two major divisions despite its ubiquity in the literature. It might be safer to delineate the protocerebrum, deutocerebrum, and fused postoral ganglia (including the pedipalp ganglion) instead.

      Thank you for this insight, we have modified the section, section headings and Figure 1 to account for this delineation as well. We have chosen to include both ways of describing the synganglion, in order to maintain a parallel with the past literature, and to be further accessible to non-specialist readers. L73-77

      (c) L 90: It might be useful to include a justification for the use of these particular neuropeptides.

      Thank you, revised. L97-99.

      (d) L 106 - 108: It is stated that the innervation pattern of the leg neuropils is generally consistent, but from Figure 2, it seems that there are differences. The density of 5HT, Proctolin, ChAT, and FMRFamide seems to be higher in the posterior legs. AstA seems to have a broader distribution in L1 and is absent in L4.

      We would still stand by the generalization that the innervation pattern is fairly similar for each leg. The L1 neuropils tend to be bigger than the posterior legs, which might explain the difference in density. Another important aspect to keep in mind is that not all of the leg neuropils appear at the exact same imaging plane as we move from ventral to dorsal. If you scroll through the synapsin stack (ventral to dorsal), you will see that L2 and L3 appear first, followed shortly by L1, and then L4, and at the dorsal end of the subesophageal they disappear in the opposite order. The observations listed here are true for the single z-plane in Figure 2, but the fact that they don’t appear at the same time seems to mainly account for these differences. For example, if you scroll further ventrally in the AstA volume, you will see a very similar innervation appear in L4 as well, even though it is absent in the Fig. 2 plane. We plan to have these individual volumes available from a repository so that they can be individually examined to better see the signal at all levels. At the moment, the entire repository can be accessed here: https://doi.org/10.35077/ace-moo-far.

      (e) Figure 1 and elsewhere: The axes for the posterior and lateral views show Lateral and Medial. It would be more accurate to label them Left and Right. because it does not define the medial-to-lateral axis. The medial direction is correct for only one hemiganglion, and it's the opposite for the contralateral side.

      Thank you, revised.

      (f) In Figures that show particular sections, it might be helpful to include a plane in the standard brain to illustrate where that section is.

      Yes, we agree and it was our original intention. It is something we can attempt to do, but there is not much room in the corners of many of the synapsin panels, making it harder to make the 3D representation big enough to be clear.

      (g) Figure 2, 3: Presenting the z-section stack separately in B and C is awkward because it makes it seem that they are unrelated. I think it would be better to display the z160-190 directly above its corresponding z230-260 for each of the exemplars in B and C. Since there's no left-right asymmetry, a hemibrain could be shown for all examples as was done for TH in D. It's not clear why TH was presented differently.

      Thank you for this suggestion. We rearranged the figure as described, but ultimately still found the original layout to be preferrable, in part because the labelling becomes too cramped. We hope that the potential confusion of the continuity of the B and C sections will be mitigated by focusing on the z plane labels and overall shape – which should suggest that the planes are not far from each other. We trust that the form of the leg neuropils is recognizable in both B and C synapsin images, and so readers will make the connection.

      Regarding TH, this panel is apart from the rest because we were unable to register the TH volume to the standard brain because the variant of the protocol which produced good anti-TH staining conflicted with synapsin, and we could not simultaneously have adequate penetration of the synapsin signal. We did not want to align the TH panel with the others to avoid potential confusion that this was a view from the same z-plane of a registered volume, as the others are. We have added a note to the figure caption.

      (h) The locations of the labels should be consistent. The antisera are below the images in Figure 2, above in Figure 3, and to the bottom left in Figure 5. The slices are shown above in Figure 2 and below in Figure 3.

      Thank you, this has been revised for better consistency.

      (i) It is surprising to me that there is no mention of the neuronal somata visible in Figure 2 and Figure 3. A typical mapping of the brain would map the locations of the neurons, not just the neuropils.

      Our first arrangement of this paper described each immunostain individually from ventral to dorsal, including locations of the immunoreactive somata which could be observed. To aid the flow of the paper and leverage the aligned volumes to emphasize co-expression in the function divisions of the brain, we re-formulated to this current layout which is organized around neuropils. Somata locations are tricky to incorporate in this format of the paper which focuses on key z-planes or tight max projections, because the relevant immunoreactive somata are more dispersed throughout the synganglion, not always overlapping in neighboring z-planes. Further, since only a minority of the antisera we used can reveal traceable projections from the supplying somata in the whole-mount preparation, we would be quite limited in the degree to which we could integrate the specific somata mapping with expression patterns in the neuropil.  Finally, compared to immuno, which can be variable in staining intensity between somata for the same target, we find that FISH reveals these locations more clearly and comprehensively – so while we agree that this mapping would also be useful for the atlas, we would like to better provide this information in a future publication using whole-mount FISH.

      (j) L 139: There is a reference to a "brace" in Figure 3B, which does not seem to exist. There's one in Figure 3C.

      There is a smaller brace near the bottom of the TDC2 panel in Fig. 3B.

      (k) L 151 should be "3D".

      Thank you, revised (L160).

      (l) Figure 4C: It is not mentioned in the legend that the bottom inset is Proctolin without synapsin.

      Thank you, revised (L1213).

      (m) L 199: Are the authors sure this subdivision is solely on the anterior-posterior axis? Could it also be dorsal ventral? (i.e., could this be an artifact of the protocerebrum and deutocerebrum?)

      Yes, this division can be appreciated to extend somewhat in the dorsal-ventral axis and it is possible that this is the protocerebrum emerging after the deutocerebrum, although this area is largely dorsal to the obvious part of the deutocerebrum. In the horizontal planes there appears to be a boundary line which we use for this subdivision in order to assist in better describing features within this generally ventral part of the protocerebrum – referred to as “stalk” because it is thinner before the protocerebrum expands in size, dorsally. Our intention was more organizational, and as stated in the text, this area is likely heterogenous and we are not suggesting that it has a unified function, so being a visual artifact would not be excluded.

      (n) L 249: Could it also indicate large tracts projecting elsewhere?

      Yes, definitely, we have evidence that part of the space is occupied by tracts. Revised, thank you (L262).

      (o) L 281: Several investigators, including Long (2021,) noted very large and robust mushroom bodies of Nephila.

      Thank you – the point is well taken that there are examples of orb-web builders that do have appreciable mushroom bodies. We have added a note in this section (L295), giving the examples of Deinopis spinosa and Argiope trifasciata (Figure 4.20 and 4.22 in Long, 2016).

      It looks like these species make the point better than Nephila, as Long lists the mushroom body percentage of total protocerebral volume for D. spinosa as 4.18%, for A. trifasciata as 2.38%, but doesn’t give a percentage for Nephila clavipes (Figure 4.24) and only labels the mushroom bodies structures as “possible” in the figure.

      In Long (2021), Nephilidae is described as follows: “In Nephilidae, I found what could be greatly reduced medullae at the caudal end of the laminae, as well as a structure that has many physical hallmarks of reduced mushroom bodies”

      (p) L 324: If the authors were able to stain for histamine or supplement this work with a different dissection technique for the dorsal structures, the visual pathways might have been apparent, which seems like a very important set of neuropils to include in a complete brain atlas.

      Yes, for this reason histamine has been an interesting target which we have attempted to visualize, but unfortunately have not yet been able to successfully stain for in U. diversus. An additional complication is that the antibodies we have seen call for glutaraldehyde fixation, which may make them incompatible with our approach to producing robust synapsin staining throughout the brain. 

      We agree that the lack of the complete visual pathway is a substantial weakness of our preparation, and should be amended in future work, but this will likely require developing a modified approach in order to preserve these delicate structures in U. diversus.

      (q) L 331: Is this bulbous shape neuropil, or just the remains of neuropil that were not fully torn away during dissection?

      This certainly is a severed part of the primary pathway, although it seems more likely that the bulbous shape is indicative of a neuropil form, rather than just being a happenstance shape that occurred during the breakage. We have examples where the same bulbous shape appears on both sides, and in different brains. It is possible that this may be the principal eye lamina – although we did not see co-staining with expected markers in examples where it did appear, so cannot be sure.

      (r) L 354: Is tyraminergic co-staining with the protocerebral bridge enough evidence to speculate that inputs are being supplied?

      We agree that this is not compelling, and have removed the statement.

      (s) L 372: This whole structure appears to be a previously described structure in spiders, the 'protocerebral commissure'.

      We are reasonably sure that what we are calling the PCB is a distinct structure from the protocerebral bridge (PCC). In Babu and Barth’s (1984) horizontal slice (Fig. 11b), you can see the protocerebral commissure immediately adjacent to the mushroom body bridge. It is found similarly located in other species, as can be seen in the supplementary 3D files provided by Steinhoff et al., (2024).

      While not visible with synapsin in U. diversus, we likewise can make out a commissure in this area in close proximity to the mushroom body bridge using tubulin staining. What we are calling the protocerebral bridge is a structure which is much more dorsal to the protocerebral commissure, not appearing in the same planes as the MB bridge.

      (t) L 377: Do you have an intuition why the tonsillar neuropil and the protocerebral bridge would show limited immunoreactivity, while the arcuate body's is quite extensive?

      This is an interesting question. Given the degree of interconnection and the fact that multiple classes of neurons in insects will innervate both central body as well as PCB or noduli, perhaps it would be expected that expression in tonsillar and protocerebral bridge should be commensurate to the innervation by that particular neurotransmitter expressing population in the arcuate body. Apart from the fact that the arcuate body is just bigger, perhaps this points to a great role of the arcuate body for integration, whereas the tonsillar and PCB may engage in more particular processing, or be limited to certain sensory modalities.

      Interestingly, it seems that this pattern of more limited immunoreactivity in the PCB and noduli compared with the central bodies (fan-shaped/ellipsoid) also appears in insects (Kahsai et al., 2010, J Comp Neuro, Timm et al., 2021, J Comp Neuro, Homberg et al., 2023, J Comp Neuro) – particularly, with almost every target having at least some layering in the fan-shaped body (Kahsai et al., 2010, J Comp Neuro).  For example, serotoninergic innervation is fairly consistently seen in the upper and lower central bodies across insects, but its presence in the PCB or noduli is more variable – appearing in one or the other in a species-dependent manner (Homberg et al., 2023, J Comp Neuro).

      (4) Discussion

      (a) L 556: But if confocal images from slices are aligned, is the 3D shape not preserved?

      Yes, fair enough – the point we wanted to make was that there is still a limitation in z resolution depending on the thickness of the slices used, which could obscure structures, but perhaps this is too minor of a comment.

      (b) L 597: This is a very interesting result. I agree it's likely to do with the processing of mechanosensory information relevant to web activities, and the mushroom body seems like the perfect candidate for this.

      (c) L 638: Worth noting that neuropil volume vs density of synapses might play a role in this, as the literature is currently a bit ambiguous with regards to the former.

      Thank you, noted (L689).

      (d) L 651: The latter seems far more plausible.

      Agreed, though the presence of mushroom bodies appears to be variable in spiders, so we didn’t want to take a strong stance, here.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Review #1 (Public review):

      Figures 1 through 4 contain data that largely recapitulate published findings (Fulton et al., 2015; Lee et al., 2024; Swee et al., 2016; Dong et al., 2021); it is noted that there is value in confirming phenotypic differences between naive CD5lo and CD5hi CD8 T cells in the NOD background. It is important to contextualize the data while being wary of making parallels with results obtained from CD5lo and CD5hi CD4 T cells. There should also be additional attention paid to the wording in the text describing the data (e.g., the authors assert that, in Figure 4C, the “CD5hi group exhibited higher percentages of CD8+ T cells producing TNF-α, IFN-γ and IL-2” though there is no difference in IL-2 nor consistent differences in TNF-α between the CD5lo and CD5hi population<sup>hi</sup> CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup> T cells have been previously characterized in other genetic backgrounds. In our study, we aimed to confirm and extend these observations specifically in the autoimmune-prone NOD background, which had not been systematically addressed. Additionally, we carefully reviewed the text describing Figure 4C and revised the wording to accurately reflect the observed data (line 263-264). Specifically, we now state that the CD5<sup>hi</sup> group exhibited higher levels of IFN-γ and a trend toward increased TNF-α, while IL-2 production did not show a significant difference.

      The comparison of CD5 across thymocyte populations is cautioned due to variation in developmental stages, particularly in transgenic models. The reported differences may reflect maturation stages rather than self-reactivity.

      We appreciate the reviewer’s important point regarding the interpretation of CD5 levels across thymocyte subsets. In our revised manuscript (lines 455–471), we have added clarification that CD5 expression in DN and DP subsets reflects pre-TCR and TCR signaling events during thymic development. We also acknowledge that differences in maturation stages, especially in the NOD8.3 transgenic model, may influence CD5 expression. We now discuss this caveat and interpret our results with caution, particularly emphasizing that our data support but do not sufficiently define their differential self-reactivity.

      The conclusion that PTPN22 overexpression does not inhibit the diabetogenic potential of CD5<sup>hi</sup>CD8<sup>+</sup> T cells is potentially confounded by differences between polyclonal and TCR transgenic systems.

      We thank the reviewer for raising this concern. We acknowledge that this system introduces confounders due to differences in precursor frequencies and clonal expansion compared to polyclonal repertoires. These differences may affect the responsiveness to phosphatase-mediated attenuation of signaling. Therefore, while our results support that high-affinity autoreactive CD8<sup>+</sup> T cells may be less sensitive to PTPN22 overexpression, we do not claim that this finding generalizes to all autoreactive CD8<sup>+</sup> T cells. Rather, it highlights a potential inability of peripheral tolerance in T cells with strong intrinsic self-reactivity.

      TCR sequencing data shows variability; is this representative of the overall repertoire?

      We appreciate the reviewer’s comment. We acknowledge that data from bulk TCR sequencing has potential limitations, including variability across experiments and limited resolution at the clonotype level. To improve representativeness and reduce sampling bias, we performed TCR repertoire analysis in two independent experiments. In each experiment, naïve CD5<sup>hi</sup> CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup> T cells were sorted from pooled peripheral lymph nodes of at least 20 individual NOD mice per group. This approach allowed us to capture a broader range of clonotypes and ensured that the resulting repertoire profiles reflect the characteristics of the overall CD5<sup>hi</sup> and CD5<sup>lo</sup> populations, rather than isolated outliers. Despite some variability, we observed consistent trends in key features, such as shorter CDR3β length, altered TRAV/TRBV usage and reduced diversity in the CD5<sup>hi</sup> subset across both experiments. To enhance resolution and directly assess clonotype-specific reactivity, we plan to perform single-cell RNA and TCR sequencing in future studies, as noted in the revised Discussion (lines 466–471).

      Clarifications are requested regarding naive gating, controls, gMFI reporting, and missing methods.

      We thank the reviewer for these specific suggestions. We have revised figure legends to better describe gating strategies and included appropriate controls in Figures or Supplementary Figures. Regarding gMFI reporting, we have now shown in the figure legends whether values are reported as gMFI. Additionally, we have added the missing methods for cytokine staining, EdU incorporation, overlapped count matrix construction and TCR repertoire diversity metrics.

      Review #2 (Public review):

      Summary Comment:

      The study is nicely performed, but the definition of naive T cells using only CD44 and CD62L may be oversimplified. CD5hi naive T cells express higher CD44 than CD5lo cells.

      We thank the reviewer for the critical evaluation and thoughtful comment. As noted, we defined naïve CD8<sup>+</sup> T cells using a well-established gating strategy based on CD44<sup>lo</sup> and CD62L<sup>hi</sup> expression, consistent with previous studies (Immunity. 2010; 32(2):214–26; Nat Immunol. 2015; 16(1):107–17). We acknowledge that CD44 is expressed along a continuum, and indeed, within the naïve gate, CD5<sup>hi</sup> CD8<sup>+</sup> T cells exhibited slightly higher CD44 levels compared to their CD5<sup>lo</sup> counterparts. However, both subsets remained well below the CD44 expression observed in conventional effector/memory CD8<sup>+</sup> T cells, supporting their classification as naïve. To further validate this, we assessed additional markers associated with activation and memory differentiation, including CD69, PD-1, KLRG1 and CD25. These analyses confirmed that the sorted CD5<sup>hi</sup> and CD5<sup>lo</sup> populations retained a phenotypically naïve profile while exhibiting meaningful differences in baseline activation readiness (Figure 1F).

      Review #3 (Public review):

      CD5 can be regulated by peripheral signals. Therefore, it cannot be concluded that predisposition to effector/memory differentiation is solely programmed in the thymus.

      We thank the reviewer for this important point. We agree that CD5 expression can be dynamically regulated in the periphery by tonic TCR signals and antigen encounter, as also reflected in our own data that cells with high CD5 level display elevated activation potential upon encountering antigen (e.g., Figure 3L). To minimize the confounding effects of pre-existing peripheral activation, we performed an adoptive T cell transfer experiment (Figure 4). In this experiment, naïve CD5<sup>hi</sup>CD<sup>+</sup>and CD5<sup>lo</sup>CD8<sup>+</sup>T cells were sorted from the peripheral lymph nodes of young (6–8-week-old) prediabetic NOD mice and transferred into NOD Rag1<sup>–/–</sup> recipients. After 4 weeks, we compared the disease phenotypes and functional profiles of CD8<sup>+</sup> T cells from these two groups. This approach allowed us to evaluate the stability and differentiation capacity of CD5<sup>hi</sup> versus CD5<sup>lo</sup> cells in a lymphopenic environment, while excluding the possibility that the observed differences were due to already activated CD8<sup>+</sup>T cells at the time of isolation. We have revised the Discussion (lines 440–450) to acknowledge these experimental limitations and clarify that, while our findings demonstrate functional differences between CD5<sup>hi</sup>CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup>T cells, we cannot fully exclude contributions from peripheral influences.

      Experiments do not explain why PTPN22 overexpression protects in polyclonal T cells but not in NOD8.3 mice.

      We appreciate this critical comment. Our findings support that autoreactive T cells with high-affinity TCRs as in NOD8.3 mice receive strong signaling that even PTPN22 overexpression is insufficient to attenuate their activation and effector function. We acknowledge that further mechanistic studies are needed to fully elucidate the differential effects of PTPN22 in polyclonal versus TCR-transgenic settings.

      Evidence that PTPN22 does not regulate TCR signaling in NOD8.3 T cells is weak.

      We thank the reviewer for this critical comment. Our data show that NOD8.3 T cells with an intrinsic high CD5-associated self-reactivity are more resistant to transgenic Pep-mediated change in the phosphorylation status of TCR signaling molecules CD3ζ and Erk and CD5 expression (Figure 6, B-D). However, we agree that additional functional assays would strengthen this conclusion.

      TCR sequencing does not conclusively link CD5hi cells with autoreactivity; single-cell analysis is needed.

      We agree with this critical comment. Bulk TCR sequencing revealed repertoire features associated with autoreactivity, but cannot definitively link specific TCRs to function. We have acknowledged this in the discussion (lines 466–471) and highlighted plans to perform single-cell analysis.

      CD5hi cells in the PLNs may reflect antigen exposure rather than basal signaling.

      We thank the reviewer for this insightful comment. As also noted in Figure 3L, CD5 expression can be influenced by peripheral tonic TCR signals and recent antigen exposure. To minimize the contribution of peripheral activation, we particularly characterized naïve CD8<sup>+</sup>T cells isolated from the peripheral lymph nodes of young (6–8-week-old) prediabetic NOD mice before the onset of overt autoimmunity. Furthermore, we performed an adoptive transfer experiment (Figure 4) using sorted naïve CD5<sup>hi</sup>CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup>T cells from these mice and characterized their disease phenotype after 4 weeks in lymphopenic NOD Rag1<sup>–/–</sup> recipients and evaluated the effector function of CD8<sup>+</sup>T cells. This approach allowed us to compare the differentiation potential of these subsets in a controlled setting, independent of their activation status at the time of isolation. We have revised the Discussion (lines 440–450) to emphasize that, while our data support functional differences between CD5<sup>hi</sup>CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup>T cells, we cannot fully exclude the role of peripheral cues in shaping CD5 expression.

      Provide proper gating controls and representative flow plots.

      We thank the reviewer for this comment. We have revised figure legends to better describe gating strategies and included representative flow cytometry plots and appropriate gating controls in Figures or Supplementary Figures.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The authors):

      (1) The figure presentation is inconsistent and the labels/font are often too small to read easily.

      As Reviewer suggested, the figure presentation has been revised for consistency. Labels and fonts have been adjusted for improved readability. Specific figures that were difficult to read have been reformatted with larger fonts and clearer legends.

      (2) A careful review of the text to ensure clarity of the content is suggested (e.g., “gratitude” at line 91, “were generally lied” at line 123).

      Thanks for Reviewer’s comments. The text has been carefully reviewed for clarity and grammatical accuracy. Corrections have been made, including changing “gratitude” to “magnitude” (line 47) and “were generally lied” to “fell between” (line 79).

      Reviewer #2 (Recommendations For The Authors):

      (1) The definition of naïve T cells based solely on CD44low and CD62Lhigh staining may be oversimplistic. Indeed, even within this definition, naïve CD5high CD8 T cells express much higher levels of CD44 than CD5low CD8 T cells.

      Thanks for Reviewer’s comments. We used a literature-supported gating strategy (Immunity. 2010; 32(2):214–26; Nat Immunol. 2015; 16(1):107–17) to define naïve T cells based on CD44<sup>low</sup> and CD62L<sup>high</sup> expression. It is important to note that CD44 expression exists along a continuum. While we were initially surprised to observe that CD5<sup>lo</sup>CD8<sup>+</sup>T cells expressed relatively higher levels of CD44 than CD5<sup>lo</sup>CD8<sup>+</sup>T cells within the naïve gate, both populations still exhibited significantly lower CD44 expression compared to conventional effector/memory CD8<sup>+</sup>T cells. To further validate the distinction between CD5<sup>hi</sup> and CD5 subsets, we also examined additional markers such as CD69, PD1, KLRG1 and CD25, which supported their phenotypic differences within the naïve compartment (Figure 1F).

      (2) Figure 1G should show the proportion of IGRP-tetramer+ in the three groups of CD8 T cells. Additionally, it would be useful to assess reactivity against a pool of other islet autoantigens using a similar strategy.

      As suggested by the reviewer, the revised manuscript now includes additional data showing the proportion of IGRP-tetramer+ cells (Supplementary Figure 1D), as well as reactivity against another islet autoantigen, insulin-1/insulin-2 (Insulin B15–23) (Supplementary Figure 1E). The description of these results, including the proportions of IGRP-tetramer<sup>+</sup> and Insulin B15–23<sup>+</sup> CD8<sup>+</sup>Tcells, has been added to lines 126–129 of the revised manuscript.

      (3) The resolution of Figure 2 is suboptimal and at places poorly visible. Figure 2D is stated to show “two significant pathways stand out.” In fact, the data are barely significant, and the authors may want to correct their statement.

      The resolution of Figure 2 has been improved. As Reviewer suggested, the text has been revised to state “two potential pathways stand out” (line 187) instead of “two significant pathways stand out”.

      (4) Figure 3C-F and 3H, showing fold change over baseline values would be much easier for the reader to grasp the data.

      As Reviewer suggested, data in Figures 3C-F and 3H now are shown in fold change over baseline values for clarity. Baseline gMFI is the mean of each group (total CD<sup>+</sup> , CD5<sup>hi</sup>CD8<sup>+</sup> and CD5<sup>lo</sup>CD8<sup>+</sup>) at 0 μg/ml anti-CD3, with fold changes calculated for stimulation conditions (0.625-10 μg/ml anti-CD3). The figure legend has been updated accordingly.

      (5) Figure 4A, it would be much more valuable to show the diabetes frequency upon transfer of CD25- CD4 T cells alone and upon transfer of CD5high CD8 T cells alone. The word “spontaneous” in the Figure 4A legend seems inappropriate.

      Thanks for the Reviewer’s comment. We apologize for not including the data for the CD25 CD4<sup>+</sup> T cell transfer group in the original manuscript. While this group was part of our initial experimental design, we had considered it a control group and unintentionally omitted it from the figure. The revised manuscript now includes this group in Figure 4A. In addition, the term “spontaneous” has been replaced with “diabetes incidence” in the Figure 4A legend and manuscript (line 248). Regarding the suggestion to assess CD5<sup>hi</sup>CD8<sup>+</sup>T cells transfer alone, we appreciate the Reviewer’s point. However, previous studies have shown that CD8<sup>+</sup> T cells alone are not effective and sufficient to induce diabetes in adoptive transfer models, and that effective β-cell destruction typically requires both CD4<sup>+</sup> and CD8<sup>+</sup> T cell subsets. For instance, Christianson et al. (1993) demonstrated that enriched CD8<sup>+</sup> T cells from NOD mice fail to transfer diabetes on their own, while CD4<sup>+</sup> T cells—particularly from diabetic donors—can induce disease only under specific conditions and are significantly potentiated by co-transfer of CD8<sup>+</sup>cells. These findings have contributed to the widely available standard of co-transferring both subsets when studying diabetogenic potential in NOD models (Diabetes. 1993;42(1):44–55).

      (6) Line 257-258, please remove “indicating superior in vivo proliferation by the CD5hi subset.” Indeed, several other possibilities may explain the phenotype, including survival, migration, etc.

      As Reviewer suggested, the phrase “indicating superior in vivo proliferation by the CD5<sup>hi</sup> subset” has been replaced with “implying increased expansion and activation/effector potential” (line 261).

      (7) Figure 5A, it is unclear to this referee what is the significance of CD5 and pCD3zeta expression on DN thymocytes. Do these cells express rearranged alpha/beta TCR? Is it signaling through pre-TCRalpha/TCRbeta pairs?

      Thanks a lot for this important question. In the revised manuscript, we have expanded the discussion (line 455–471) to address the developmental significance of CD5 and pCD3ζ expression on DN thymocytes. CD5 expression at this stage reflects pre-TCR signaling strength during early selection, which occurs following successful TCRβ rearrangement. The associated phosphorylation of CD3ζ indicates activation of downstream signaling through the pre-TCRα/TCRβ complex. As discussed in the revised text, these early signals play a critical role in determining lineage progression and self-reactivity tuning. We now acknowledge that signaling at the DN stage occurs through the pre-TCRα/TCRβ heterodimer, not a fully rearranged αβ TCR, and that CD5 expression serves as a marker of the strength of these initial pre-selection signals (Sci Signal. 2022;15(736):eabj9842.). These developmental checkpoints are essential for calibrating TCR sensitivity and ensuring proper thymocyte maturation. This has been clarified in the revised discussion (line 455–471).

      (8) Figure 5F, could the DP TCRbeta- CD69- thymocytes from 8.3-TCR NOD mice already express low levels of the self-reactive TCR at this stage to explain their high expression of CD5? Addressing the question experimentally would be useful.

      Thanks a lot for this useful comment. According to a review by Huseby et al. (2022), expression of a functional TCRβ chain begins at the DN3 stage, initiating progression through the β-selection checkpoint. This is followed by TRAV locus recombination, resulting in the generation of αβ TCR-expressing double-positive 1 (DP-1) thymocytes. At the DP-1 stage, the quality of TCR signaling driven by self-pMHC interactions governs both positive and negative selection, as well as the development of nonconventional T cell lineages. We hypothesize that in transgenic NOD8.3 mice, which express pre-rearranged Tcra and Tcrb transgenes derived from the islet-reactive CD8<sup>+</sup>T cell clone NY8.3, thymocytes undergo allelic exclusion and lack the clonal diversity seen in non-transgenic mice. As a result, NOD8.3 thymocytes may receive strong TCR signals from early developmental stages (DN3 and DP-1) even without undergoing normal selection checkpoints. While the elevated TCR signal observed in NOD8.3 is indeed artificial, this model provides a unique system to test our hypothesis—namely, whether a strongly self-reactive TCR can generate high basal signaling during thymic development that overrides the negative regulatory effects of phosphatases like Pep. This possibility has been acknowledged in the revised Discussion section, along with a plan to validate the hypothesis experimentally (line 455–471).

      (9) Figure 7, single-cell TCR-seq would be much more appropriate to tackle the question of self-reactivity of CD5hi vs. CD5low CD8 T cells.

      Thanks a lot for this useful comment. The limitations of bulk TCR-seq are acknowledged, and single-cell TCR-seq is proposed as a future direction (line 455–471).

      Note, for Reviewer #2 (Recommendations For The Authors) (7) (8) (9), the discussion paragraphs are included to address the reviewers’ questions (line 455–471).

      Reviewer #3 (Recommendations For The Authors):

      (1) Positive controls (activated T cells from PLN or spleen), gating controls (whole naïve T cells), and representative flow-cytometry plots are needed for T-bet, EOMES, GzmB, and cytokine staining in Figure 1.

      As Reviewer suggested, we added representative gating controls for T-bet, EOMES, GzmB and cytokine staining in Supplementary Figure 1 of revised manuscript.

      (2) For Figure 1F, MFI for activation markers for the CD44hiCD62Llo cells should be provided for the comparison of PLN data.

      As Reviewer suggested, MFI data for these markers have been included in Figure 1F of revised manuscript.

      (3) In many places and figure legends, it is not mentioned from which organ cells were collected, i.e., spleen or PLN.

      As Reviewer suggested, the origin of cells for each experiment has been explicitly indicated in the figure legends or figure content to ensure clarity.

      (4) In the pancreatic lymph node, autoreactive T cells might be upregulating CD5 because they are encountering antigens. This should be addressed in the discussion.

      As Reviewer suggested, this issue has been included in the discussion of revised manuscript (line 440-450).

      (5) It is not clear if T cells from the spleen and PLN were stimulated to detect the production of pro-inflammatory cytokines.

      Thanks for the critical comment. The stimulation protocol and cytokine staining method have been added to the Supplementary material’s Supplementary methods section Cytokine staining in revised manuscript.

      (6) Figure 4C-D: It is not clear if analysis was done on naïve T cells or if they were stimulated.

      Thanks for the comment. Additionally, the stimulation and cytokine staining methods used in Figure 4C-D have been described in detail in the Supplementary Materials section Cytokine staining of revised manuscript.

      (7) IGRP gating in Figure 4F should be revisited with negative controls.

      Thanks for the critical comment. Negative controls have been added and used to adjust IGRP gating, and this is now mentioned in the figure legend of revised manuscript.

      (8) Interpretation that only CD5hi cells form a central memory T cell population (Figure 4F) could be misleading.

      Thanks for this valuable comment. We agree with that in conventional CD8<sup>+</sup> T cell immune responses, both CD5<sup>hi</sup> and CD5<sup>lo</sup> subsets have the potential to differentiate into central memory T cells. In our experimental approach, we adoptively transferred sorted CD5<sup>hi</sup>CD8<sup>+</sup> or CD5<sup>lo</sup>CD8<sup>+</sup>cells into Rag1<sup>-/-</sup> recipients and specifically analyzed PLNs four weeks after transfer. Using CD44 and CD62L expression as conventional markers for central memory T cells, we barely observed a CD44<sup>hi</sup>CD62L<sup>hi</sup> population in CD5<sup>lo</sup>CD8<sup>+</sup>transferred group. Based on these results, we stated: “This analysis underscores that the central memory T cell population and the frequency of islet autoantigen-specific CD8<sup>+</sup>T cells are higher in the CD5<sup>hi</sup> transferred subset within the PLNs, implying more robust immune responses initiated by the CD5<sup>hi</sup>cells” (line 272–274). Importantly, we did not intend to imply that only CD5<sup>hi</sup> cells can form central memory T cells, but rather that they were more enriched for this phenotype under the specific conditions and time point analyzed. 

      (9) IL-2 gating representative plot should be provided for Figure 5A.

      As Reviewer suggested, a representative IL-2 gating plot has been included in the revised Supplementary Figure 3B.

    1. Author response:

      (1) General Statements

      The goal of our study was to mechanistically connect microbiota to host longevity. We have done so using a combination of genetic and physiological experiments, which outline a role for a neuroendocrine relay mediated by the intestinal neuropeptide Tachykinin, and its receptor TkR99D in neurons. We also show a requirement for these genes in metabolic and healthspan effects of microbiota.

      The referees' comments suggest they find the data novel and technically sound. We have added data in response to numerous points, which we feel enhance the manuscript further, and we have clarified text as requested. Reviewer #3 identified an error in Figure 4, which we have rectified. We felt that some specific experiments suggested in review would not add significant further depth, as we articulate below.

      Altogether our reviewers appear to agree that our manuscript makes a significant contribution to both the microbiome and ageing fields, using a large number of experiments to mechanistically outline the role(s) of various pathways and tissues. We thank the reviewers for their positive contributions to the publication process.

      (2) Description of the planned revisions

      Reviewer #2:

      Not…essential for publication…is it possible to look at Tk protein levels?

      We have acquired a small amount of anti-TK antibody and we will attempt to immunostain guts associated with A. pomorum and L. brevis. We are also attempting the equivalent experiment in mouse colon reared with/without a defined microbiota. These experiments are ongoing, but we note that the referee feels that the manuscript is a publishable unit whether these stainings succeed or not.

      (3) Description of the revisions that have already been incorporated in the transferred manuscript

      Reviewer #1:

      Can the authors state in the figure legends the numbers of flies used for each lifespan and whether replicates have been done?

      We have incorporated the requested information into legends for lifespan experiments.

      Do the interventions shorten lifespan relative to the axenic cohort? Or do they prevent lifespan extension by axenic conditions? Both statements are valid, and the authors need to be consistent in which one they use to avoid confusing the reader.

      We read these statements differently. The only experiment in which a genetic intervention prevented lifespan extension by axenic conditions is neuronal TkR86C knockdown (Figure 6B-C). Otherwise, microbiota shortened lifespan relative to axenic conditions, and genetic knockdowns extend blocked this effect (e.g. see lines 131-133). We have ensured that the framing is consistent throughout, with text edited at lines 198-199, 298-299, 311-312, 345-347, 407-408, 424-425, 450, 497-503.

      TkRNAi consistently reduces lipid levels in axenic flies (Figs 2E, 3D), essentially phenocopying the loss of lipid stores seen in control conventionally reared (CR) flies relative to control axenic. This suggests that the previously reported role of Tk in lipid storage - demonstrated through increased lipid levels in TkRNAi flies (Song et al (2014) Cell Rep 9(1): 40) - is dependent on the microbiota. In the absence of the microbiota TkRNAi reduces lipid levels. The lack of acknowledgement of this in the text is confusing

      We have added text at lines 219-222 to address this point. We agree that this effect is hard to interpret biologically, since expressing RNAi in axenics has no additional effect on Tk expression (Figure S7). Consequently we can only interpret this unexpected effect as a possible off-target effect of RU feeding on TAG, specific to axenic flies. However, this possibility does not void our conclusion, because an off-target dimunition of TAG cannot explain why CR flies accumulate TAG following Tk<sup>RNAi</sup> induction. We hope that our added text clarifies.

      I have struggled to follow the authors logic in ablating the IPCs and feel a clear statement on what they expected the outcome to be would help the reader.

      We have added the requested statement at lines 423-424, explaining that we expected the IPC ablation to render flies constitutively long-lived and non-responsive to A pomorum.

      Can the authors clarify their logic in concluding a role for insulin signalling, and qualify this conclusion with appropriate consideration of alternative hypotheses?

      We have added our logic at lines 449-454. In brief, we conclude involvement for insulin signalling because FoxO mutant lifespan does not respond to Tk<sup>RNAi</sup>, and diminishes the lifespan-shortening effect of A. pomorum. However, we cannot state that the effects are direct because we do not have data that mechanistically connects Tk/TkR99D signalling directly in insulin-producing cells. The current evidence is most consistent with insulin signalling priming responses to microbiota/Tk/TkR99D, as per the newly-added text.

      Typographical errors

      We have remedied the highlighted errors, at lines 128-140.

      Reviewer #2:

      it would be good to show that the bacterial levels are not impacted [by TkRNAi]

      We have quantified CFUs in CR flies upon ubiquitous TkRNAi (Figure S5), finding that the RNAi does not affect bacterial load. New text at lines 138-139 articulates this point.

      The effect of Tk RNAi on TAG is opposite in CR and Ax or CR and Ap flies, and the knockdown shows an effect in either case (Figure 2E, Figure 3D). Why is this?

      As per response to Reviewer #1, we have added text at lines 219-222 to address this point.

      Is it possible to perform at least one lifespan repeat with the other Tk RNAi line mentioned?

      We have added another experiment showing longevity upon knockdown in conventional flies, using an independent TkRNAi line (Figure S3).

      Reviewer #3:

      In Line243, the manuscript states that the reporter activity was not increased in the posterior midgut. However, based on the presented results in Fig4E, there is seemingly not apparent regional specificity. A more detailed explanation is necessary.

      We thank the reviewer sincerely for their keen eye, which has highlighted an error in the previous version of the figure. In revisiting this figure we have noticed, to our dismay, that the figures for GFP quantification were actually re-plots of the figures for (ac)K quantification. This error led to the discrepancy between statistics and graphics, which thankfully the reviewer noticed. We have revised the figure to remedy our error, and the statistics now match the boxplots and results text.

      Fig1C uses Adh for normalization. Given the high variability of the result, the authors should (1) check whether Adh expression levels changed via bacterial association

      We selected Adh on the basis of our RNAseq analysis, which showed it was not different between AX and CV guts, whereas many commonly-used “housekeeping” genes were. We have now added a plot to demonstrate (Figure S2).

      The statement in Line 82 that EEs express 14 peptide hormones should be supported with an appropriate reference

      We have added the requested reference (Hung et al, 2020) at line 86.

      (4) Description of analyses that authors prefer not to carry out

      Reviewer #1:

      I'd encourage the authors to provide lifespan plots that enable comparison between all conditions

      We have avoided this approach because the number of survival curves that would need to be presented on the same axis (e.g. 16 for Figure 5) is not legible. However we have ensured that axes on faceted plots are equivalent and with grid lines for comparison. Moreover, our approach using statistical coefficients (EMMs) enables direct quantitative comparison of the differences among conditions.

      Reviewer #2:

      Is it possible that this driver is simply not resulting in an efficient KD of the receptor? I would be inclined to check this

      This comment relates to Figure 7G. We do see an effect of the knockdown in this experiment, so we believe that the knockdown is effective. However the direction of response is not consistent with our hypothesis so the experiment is not informative about the role of these cells. We therefore feel there is little to be gained by testing efficacy of knockdown, which would also be technically challenging because the cells are a small population in a larger tissue which expresses the same transcripts elsewhere (i.e. necessitating FISH).

      Would it be possible to use antibodies for acetylated histones?

      The comment relates to Figure 4C-E. The proposed studies would be a significant amount of work because, to our knowledge, the specific histone marks which drive activation in TK+ cells remain unknown. On the other hand, we do not see how this information would enrich the present story, rather such experiments would appear to be the beginning of something new. We therefore agree with Reviewer #1 (in cross-commenting) that this additional work is not justified.

      Reviewer #3:

      Tk+ EEC activity should be assessed directly, rather than relying solely on transcript levels. Approaches such as CaLexA or GCaMP could be used.

      We agree with reviewers 1-2 (in cross-commenting) that this proposal is non-trivial and not justified by the additional insight that would be gained. As described above, we are attempting to immunostain Tk, which if successful will provide a third line of evidence for regulation of Tk+ cells. However we note that we already have the strongest possible evidence for a role of these cells via genetic analysis (Figure 5).

      While the difficulty of maintaining lifelong axenic conditions is understandable, it may still be feasible to assess the induction of Tk (ie. Tk transcription or EE activity upregulation) by the microbiome on males.

      As the reviewer recognises, maintaining axenic experiments for months on end is not trivial. Given the tendency for males either to simply mirror female responses to lifespan-extending interventions, or to not respond at all, we made the decision in our work to only study females. We have instead emphasised in the manuscript that results are from female flies.

      TkR86C, in addition to TkR99D, may be involved in the A. pomorum-lifespan interaction. Consider revising the title to refer more generally to the "tachykinin receptor" rather than only TkR99D.

      We disagree with this interpretation: the results do not show that TkR86C-RNAi recapitulates the effect of enteric Tk-RNAi. A potentially interesting interaction is apparent, but the data do not support a causal role for TkR86C. A causal role is supported only for TkR99D, knockdown of which recapitulates the longevity of axenic flies and Tk<sup>RNAi</sup> flies_._ Therefore we feel that our current title is therefore justified by the data, and a more generic version would misrepresent our findings.

      The difference between "aging" and "lifespan" should also be addressed.

      The smurf phenotype is a well-established metric of healthspan. Moreover, lifespan is the leading aggregate measure of ageing. We therefore feel that the use of “ageing” in the title is appropriate.

      If feasible, assessing foxo activation would add mechanistic depth. This could be done by monitoring foxo nuclear localization or measuring the expression levels of downstream target genes.

      Foxo nuclear localisation has already been shown in axenic flies (Shin et al, 2011). We have added text and citation at lines 401-402.

    1. Author response:

      We thank the reviewers for their thoughtful, constructive, and generous evaluations of our manuscript. We are encouraged by their overall assessment of the clarity, novelty, and significance of the work, and we appreciate the opportunity to further strengthen the manuscript.

      Both reviewers highlight the central contribution of this study: a developmental, circuitlevel dissection of how heart–brain signaling emerges in a vertebrate. We are pleased that the evidence supporting the staggered assembly of vagal motor, sympathetic, and sensory pathways was found to be compelling, and that the computational and experimental framework was viewed as appropriate and informative.

      Below, we briefly outline how we plan to address the main points raised in the reviews.

      Heart rate variability and temporal structure

      Both reviewers note that heart rate variability (HRV) changes across development and suggest that HRV may provide additional insight into the function of autonomic circuits. We agree that HRV is an important physiological readout and that its developmental changes are consistent with the progressive emergence of autonomic control.

      In the revised manuscript, we plan to (i) discuss heart rate variability more explicitly in the context of circuit maturation and (ii) clarify the temporal scales captured by our experiments and modeling framework. In particular, we will emphasize that our analyses focus on relationships between neural activity and heart-rate trajectories at timescales accessible given imaging rate and indicator kinetics, rather than beat-to-beat variability. We will also consider adding a supplementary analysis of the variability that can be reliably measured within these constraints, and, where appropriate, how neural activity predicts that measurable variation.

      Scope and interpretation of the computational models

      Reviewer #2 raises thoughtful points regarding what the generalized linear models can and cannot disambiguate, particularly when multiple efferent pathways may contribute to heart-rate dynamics. We will revise the text to more clearly distinguish between functional encoding relationships inferred from the models and anatomical connectivity that is directly demonstrated.

      Our intent is to frame the kernels identified in the motor and sympathetic pathways as computational motifs that capture distinct dynamical contributions, rather than as exclusive or complete explanations of heart-rate control. We will clarify these limitations explicitly in the Results and Discussion.

      Circuit diagram and anatomical interpretation

      We appreciate the reviewer’s careful reading of the proposed circuit schematic. In the revised manuscript, we will revise the figure and accompanying text to clearly annotate which connections are directly observed, which are functionally inferred, and which remain hypothetical. We will also expand the Discussion to explicitly address open questions, including unresolved feedback pathways and the potential for additional nodes in the circuit.

      We believe these revisions will improve clarity without altering the core conclusions of the study. We thank the reviewers again for their insightful feedback and look forward to submitting a revised version of the manuscript that addresses these points in detail.

    1. Author response:

      We thank the editors and reviewers for their generally positive and thoughtful feedback on this work. Below are provisional responses to some of the concerns raised:

      Reviewer 1:

      At a total scan duration of 2 minutes, the ASL sequence utilized in this cohort is much shorter than that of a typical ASL sequence (closer to 5 minutes as mentioned by the authors). However, this implementation also included multiple (n=5) PLDs. As currently described, it is unclear how any repetitions were acquired at each PLD and whether these were acquired efficiently (i.e., with a Look-Locker readout) or whether individual repetitions within this acquisition were dedicated to a single PLD. If the latter, the number of repetitions per PLD (and consequently signal-to-noise-ratio, SNR) is likely to be very low. Have the authors performed any analyses to determine whether the signal in individual subjects generally lies above the noise threshold? This is particularly relevant for white matter, which is the focus of several findings discussed in the study.

      We agree that this was a short acquisition compared to most ASL protocols, necessitated by the strict time-keeping requirements for running such a large study. We apologise if this was not clear in the original manuscript, but due to this time constraint and the use of a segmented readout (which was not Look-Locker) there was only time available for a single average at each PLD. This does mean that the perfusion weighted images at each PLD are relatively noisy, although the image quality with this sequence was still reasonable, as demonstrated in Figure 1, with perfusion weighted images visibly above the noise floor. In addition, as has been demonstrated theoretically and experimentally in recent work (Woods et al., 2023, 2020), even though the SNR of each individual PLD image might be low in multi-PLD acquisitions, this is effectively recovered during the model fitting process, giving it comparable or greater accuracy than a protocol which collects many averages at a single (long) PLD. As also noted by the reviewers, this approach has the further benefit of allowing ATT estimation, which has proven to provide useful and complementary information to CBF. Finally, the fact that many of the findings in this study pass strict statistical thresholds for significance, despite the many multiple comparisons performed, and that the spatial patterns of these relationships are consistent with expectations, even in the white matter (e.g. Figure 6B), give us confidence that the perfusion estimation is robust. However, we will consider adding some additional metrics around SNR or fitting uncertainty in a revised manuscript, as well as clarifying details of the acquisition.

      Hematocrit is one of the variables regressed out in order to reduce the effect of potential confounding factors on the image-derived phenotypes. The effect of this, however, may be more complex than accounting for other factors (such as age and sex). The authors acknowledge that hematocrit influences ASL signal through its effect on longitudinal blood relaxation rates. However, it is unclear how the authors handled the fact that the longitudinal relaxation of blood (T1Blood) is explicitly needed in the kinetic model for deriving CBF from the ASL data. In addition, while it may reduce false positives related to the relationships between dietary factors and hematocrit, it could also mask the effects of anemia present in the cohort. The concern, therefore, is two-fold: (1) Were individual hematocrit values used to compute T1Blood values? (2) What effect would the deconfounding process have on this?

      We agree this is an important point to clarify. In this work we decided not to use the haematocrit to directly estimate the T1 of blood for each participant a) because this would result in slight differences in the model fitting for each subject, which could introduce bias (e.g. the kinetic model used assumes instantaneous exchange between blood water and tissue, so changing the T1 of blood for each subject could make us more sensitive to inaccuracies in this assumption); and b) because typically the haematocrit measures were quite some time (often years) prior to the imaging session, leading to an imperfect correction. We therefore took the pragmatic approach to simply regress each subject’s average haematocrit reading out of the IDP and voxelwise data to prevent it contributing to apparent correlations caused by indirect effects on blood T1. However, we agree with the reviewer that this certainly would mask the effects of anaemia in this cohort, so for researchers interested in this condition a different approach should be taken. We will update the revised manuscript to try to clarify these points.

      The authors leverage an observed inverse association between white matter hyperintensity volume and CBF as evidence that white matter perfusion can be sensitively measured using the imaging protocol utilized in this cohort. The relationship between white matter hyperintensities and perfusion, however, is not yet fully understood, and there is disagreement regarding whether this structural imaging marker necessarily represents impaired perfusion. Therefore, it may not be appropriate to use this finding as support for validation of the methodology.

      We appreciate the reviewer’s point that there is still debate about the relationship between white matter hyperintensities and perfusion. We therefore agree that this observed relationship therefore does not validate the methodology in the sense that it is an expected finding, but it does demonstrate that the data quality is sufficient to show significant correlations between white matter hyperintensity volume and perfusion, even in white matter regions, which would not be the case if the signal there were dominated by noise. Similarly, the clear spatial pattern of perfusion changes in the white matter that correlate with DTI measures in the same regions also suggests there is sensitivity to white matter perfusion. However, we will update the wording in the revised manuscript to try to clarify this point.

      Reviewer 2:

      This study primarily serves to illustrate the efficacy and potential of ASL MRI as an imaging parameter in the UK Biobank study, but some of the preliminary observations will be hypothesis-generating for future analyses in larger sample sizes. However, a weakness of the manuscript is that some of the reported observations are difficult to follow. In particular, the associations between ASL and resting fMRI illustrated in Figure 7 and described in the accompanying Results text are difficult to understand. It could also be clearer whether the spatial maps showing ASL correlates of other image-derived phenotypes in Figure 6B are global correlations or confined to specific regions of interest. Finally, while addressing partial volume effects in gray matter regions by covarying for cortical thickness is a reasonable approach, the Methods section seems to imply that a global mean cortical thickness is used, which could be problematic given that cortical thickness changes may be localized.

      We apologise if any of the presented information was unclear and will try to improve this in our revised manuscript. To clarify, the spatial maps associated with other (non-ASL) IDPs were generated by calculating the correlation between the ASL CBF or ATT in every voxel in standard space with the non-ASL IDP of interest, not the values of the other imaging modality in the same voxel. No region-based masking was used for this comparison. This allowed us to examine whether the correlation with this non-ASL IDP was only within the same brain region or if the correlations extended to other regions too.

      We also agree that the associations between ASL and resting fMRI are not easy to interpret. We therefore tried to be clear in the manuscript that these were preliminary findings that may be of interest to others, but clearly further study is required to explore this complex relationship further. However, we will try to clarify how the results are presented in the revised manuscript.

      In relation to partial volume effects, we did indeed use only a global measure of cortical thickness in the deconfounding and we acknowledged that this could be improved in the discussion: [Partial volume effects were] “mitigated here by the inclusion of cortical thickness in the deconfounding process, although a region-specific correction approach that is aware of the through-slice blurring (Boscolo Galazzo et al., 2014) is desirable in future iterations of the ASL analysis pipeline.” As suggested here, although this is a coarse correction, we did not feel that a more comprehensive partial volume correction approach could be used without properly accounting for the through-slice blurring effects from the 3D-GRASE acquisition (that will vary across different brain regions), which is not currently available, although this is an area we are actively working on for future versions of the image analysis pipeline. We again will try to clarify this point further in the revised manuscript.

      References

      Woods JG, Achten E, Asllani I, Bolar DS, Dai W, Detre J, Fan AP, Fernández-Seara M, Golay X, Günther M, Guo J, Hernandez-Garcia L, Ho M-L, Juttukonda MR, Lu H, MacIntosh BJ, Madhuranthakam AJ, Mutsaerts HJ, Okell TW, Parkes LM, Pinter N, Pinto J, Qin Q, Smits M, Suzuki Y, Thomas DL, Van Osch MJP, Wang DJ, Warnert EAH, Zaharchuk G, Zelaya F, Zhao M, Chappell MA. 2023. Recommendations for Quantitative Cerebral Perfusion MRI using Multi-Timepoint Arterial Spin Labeling: Acquisition, Quantification, and Clinical Applications (preprint). Open Science Framework. doi:10.31219/osf.io/4tskr

      Woods JG, Chappell MA, Okell TW. 2020. Designing and comparing optimized pseudo-continuous Arterial Spin Labeling protocols for measurement of cerebral blood flow. NeuroImage 223:117246. doi:10.1016/j.neuroimage.2020.117246

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews: 

      Reviewer #2 (Public review): 

      Summary: 

      This is an interesting study exploring methods for reconstructing visual stimuli from neural activity in the mouse visual cortex. Specifically, it uses a competition dataset (published in the Dynamic Sensorium benchmark study) and a recent winning model architecture (DNEM, dynamic neural encoding model) to recover visual information stored in ensembles of mouse visual cortex. 

      Strengths: 

      This is a great start for a project addressing visual reconstruction. It is based on physiological data obtained at a single-cell resolution, the stimulus movies were reasonably naturalistic and representative of the real world, the study did not ignore important correlates such as eye position and pupil diameter, and of course, the reconstruction quality exceeded anything achieved by previous studies. There appear to be no major technical flaws in the study, and some potential confounds were addressed upon revision. The study is an enjoyable read. 

      Weaknesses: 

      The study is technically competent and benchmark-focused, but without significant conceptual or theoretical advances. The inclusion of neuronal data broadens the study's appeal, but the work does not explore potential principles of neural coding, which limits its relevance for neuroscience and may create some disappointment to some neuroscientists. The authors are transparent that their goal was methodological rather than explanatory, but this raises the question of why neuronal data were necessary at all, as more significant reconstruction improvements might be achievable using noise-less artificial video encoders alone (network-to-network decoding approaches have been done well by teams such as Han, Poggio, and Cheung, 2023, ICML). Yet, even within the methodological domain, the study does not articulate clear principles or heuristics that could guide future progress. The finding that more neurons improve reconstruction aligns with well-established results in the literature that show that higher neuronal numbers improve decoding in general (for example, Hung, Kreiman, Poggio, and DiCarlo, 2005) and thus may not constitute a novel insight. 

      We thank the reviewer for this second round of comments and hope we were able to address the remaining points below. 

      Indeed, using surrogate noiseless data is interesting and useful when developing such methods, or to demonstrate that they work in principle. But in order to evaluate if they really work in practice, we need to use real neuronal data. While we did not try movie reconstruction from layers within artificial neural networks as surrogate data, in Supplementary Figure 3C we provide the performance of our method using simulated/predicted neuronal responses from the dynamic neural encoding model alongside real neuronal responses.

      Specific issues: 

      (1)The study showed that it could achieve high-quality video reconstructions from mouse visual cortex activity using a neural encoding model (DNEM), recovering 10-second video sequences and approaching a two-fold improvement in pixel-by-pixel correlation over attempts. As a reader, I was left with the question: okay, does this mean that we should all switch to DNEM for our investigations of mouse visual cortex? What makes this encoding model special? It is introduced as "a winning model of the Sensorium 2023 competition which achieved a score of 0.301...single trial correlation between predicted and ground truth neuronal activity," but as someone who does not follow this competition (most eLife readers are not likely to do so, either), I do not know how to gauge my response. Is this impressive? What is the best theoretical score, given noise and other limitations? Is the model inspired by the mouse brain in terms of mechanisms or architecture, or was it optimized to win the competition by overfitting it to the nuances of the data set? Of course, I know that as a reader, I am invited to read the references, but the study would stand better on its own, if it clarified how its findings depended on this model. 

      The revision helpfully added context to the Methods about the range of scores achieved by other models, but this information remains absent from the Abstract and other important sections. For instance, the Abstract states, "We achieve a pixel-level correlation of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses," yet this point estimate (presented without confidence intervals or comparisons to controls) lacks meaning for readers who are not told how it compares to prior work or what level of performance would be considered strong. Without such context, the manuscript undercuts potentially meaningful achievements. 

      We appreciate that the additional information about the performance of the SOTA DNEM to predict neural responses could be made more visible in the paper and will therefore move it from the methods to the results section instead: 

      Line 348 “This model achieved an average single-trial correlation between predicted and ground truth neural activity of 0.291 during the competition, this was later improved to 0.301. The competition benchmark models achieved 0.106, 0.164 and 0.197 single-trial correlation, while the third and second place models achieved 0.243 and 0.265. Across the models, a variety of architectural components were used, including 2D and 3D convolutional layers, recurrent layers, and transformers, to name just a few.” will be moved to the results.

      With regard to the lack of context for the performance of our reconstruction in the abstract, we may have overcorrected in the previous revision round and have tried to find a compromise which gives more context to the pixel-level correlation value: 

      Abstract: “We achieve a pixel-level correlation of 0.57 (95% CI [0.54, 0.60]) between ground-truth movies and single-trial reconstructions. Previous reconstructions based on awake mouse V1 neuronal responses to static images achieved a pixel-level correlation of 0.238 over a similar retinotopic area.”

      (2) Along those lines, the authors conclude that "the number of neurons in the dataset and the use of model ensembling are critical for high-quality reconstructions." If true, these principles should generalize across network architectures. I wondered whether the same dependencies would hold for other network types, as this could reveal more general insights. The authors replied that such extensions are expected (since prior work has shown similar effects for static images) but argued that testing this explicitly would require "substantial additional work," be "impractical," and likely not produce "surprising results." While practical difficulty alone is not a sufficient reason to leave an idea untested, I agree that the idea that "more neurons would help" would be unsurprising. The question then becomes: given that this is a conclusion already in the field, what new principle or understanding has been gained in this study? 

      As mentioned in our previous round of revisions, we chose not to pursue the comparison of reconstructions using different model architectures in this manuscript because we did not think it would add significant insights to the paper given the amount of work it would require, and we are glad the reviewer agrees. 

      While the fact that more neurons result in better reconstructions is unsurprising, how quickly performance drops off will depend on the robustness of the method, and on the dimensionality of the decoding/reconstruction task (decoding grating orientation likely requires fewer neurons than gray scale image reconstruction, which in turn likely requires fewer neurons than full color movie reconstruction). How dependent input optimization based image/movie reconstruction is on population size has not been shown, so we felt it was useful for readers to know how well movie reconstruction works with our method when recording from smaller numbers of neurons. 

      (3) One major claim was that the quality of the reconstructions depended on the number of neurons in the dataset. There were approximately 8000 neurons recorded per mouse. The correlation difference between the reconstruction achieved by 1000 neurons and 8000 neurons was ~0.2. Is that a lot or a little? One might hypothesize that 7000 additional neurons could contribute more information, but perhaps, those neurons were redundant if their receptive fields are too close together or if they had the same orientation or spatiotemporal tuning. How correlated were these neurons in response to a given movie? Why did so many neurons offer such a limited increase in correlation? Originally, this question was meant to prompt deeper analysis of the neural data, but the authors did not engage with it, suggesting a limited understanding of the neuronal aspects of the dataset. 

      We apologize that we did not engage with this comment enough in the previous round. We assumed that the question arose because there was a misunderstanding about figure 5: 1000 not 1 neuron is sufficient to reconstruct the movies to a pixel-level correlation of 0.344. Of course, the fact that increasing the number of neurons from 1000 to 8000 only increased the reconstruction performance from 0.344 to 0.569 (65% increase in correlation) is still worth discussing. To illustrate this drop in performance qualitatively, we show 3 example frames from movie reconstructions using 1000-8000 neurons in Author response image 1.

      Author response image 1.

      3 example frames from reconstructions using different numbers of neurons. 

      As the reviewer points out, the diminishing returns of additional neurons to reconstruction performance is at least partly because there is redundancy in how a population of neurons represents visual stimuli. In supplementary figure S2, we inferred the on-off receptive fields of the neurons and show that visual space is oversampled in terms of the receptive field positions in panel C. However, the exact slope/shape of the performance vs population size curve we show in Figure 5 will also depend on the maximum performance of our reconstruction method, which is limited in spatial resolution (Figure 4 & Supplementary Figure S5). It is possible that future reconstruction approaches will require fewer neurons than ours, so we interpret this curve rather as a description of the reconstruction method itself than a feature of the underlying neuronal code. For that reason, we chose caution and refrained from making any claims about neuronal coding principles based on this plot. 

      (4) We appreciated the experiments testing the capacity of the reconstruction process, by using synthetic stimuli created under a Gaussian process in a noise-free way. But this originally further raised questions: what is the theoretical capability for reconstruction of this processing pipeline, as a whole? Is 0.563 the best that one could achieve given the noisiness and/or neuron count of the Sensorium project? What if the team applied the pipeline to reconstruct the activity of a given artificial neural network's layer (e.g., some ResNet convolutional layer), using hidden units as proxies for neuronal calcium activity? In the revision, this concern was addressed nicely in the review in Supplementary Figure 3C. Also, one appreciates that as a follow up, the team produced error maps (New Figure 6) that highlight where in the frames the reconstruction are likely to fail. But the maps went unanalyzed further, and I am not sure if there was a systematic trend in the errors. 

      We are happy to hear that we were able to answer the reviewers’ question of what the maximum theoretical performance of our reconstruction process is in figure 3C. Regarding systematic trends in the error maps, we also did not observe any clear systematic trends. If anything, we noticed that some moving edges were shifted, but we do not think we can quantify this effect with this particular dataset. 

      (5) I was encouraged by Figure 4, which shows how the reconstructions succeeded or failed across different spatial frequencies. The authors note that "the reconstruction process failed at high spatial frequencies," yet it also appears to struggle with low spatial frequencies, as the reconstructed images did not produce smooth surfaces (e.g., see the top rows of Figures 4A and 4B). In regions where one would expect a single continuous gradient, the reconstructions instead display specular, high-frequency noise. This issue is difficult to overlook and might deserve further discussion. 

      Thank you for pointing this out, this is indeed true. The reconstructions do have high frequency noise. We mention this briefly in line 102 “Finally, we applied a 3D Gaussian filter with sigma 0.5 pixels to remove the remaining static noise (Figure S3) and applied the evaluation mask.” In revisiting this sentence, we think it is more appropriate to replace “remove” with “reduce”. This noise is more visible in the Gaussian noise stimuli (Figure 4) because we did not apply the 3D Gaussian filter to these reconstructions, in case it interfered with the estimates of the reconstruction resolution limits. 

      Given that the Gaussian noise and drifting grating stimuli reconstructions were from predicted activity (“noise-free”), this high-frequency noise is not biological in origin and must therefore come from errors in our reconstruction process. This kind of high-frequency noise has previously been observed in feature visualization (optimizing input to maximize the activity of a specific node within a neural network to visualize what that node encodes; Olah, et al., "Feature Visualization", https://distill.pub/2017/feature-visualization/, 2017). It is caused by a kind of overfitting, whereby a solution to the optimization is found that is not “realistic”. Ways of combating this kind of noise include gradient smoothing, image smoothing, and image transformations during optimization, but these methods can restrict the resolution of the features that are recovered. Since we were more interested in determining the maximum resolution of stimuli that can be reconstructed in Figure 4 and Supplementary Figures 5-6, we chose not to apply these methods.

      Reviewer #3 (Public review): 

      Summary: 

      This paper presents a method for reconstructing input videos shown to a mouse from the simultaneously recorded visual cortex activity (two-photon calcium imaging data). The publicly available experimental dataset is taken from a recent brain-encoding challenge, and the (publicly available) neural network model that serves to reconstruct the videos is the winning model from that challenge (by distinct authors). The present study applies gradient-based input optimization by backpropagating the brain-encoding error through this selected model (a method that has been proposed in the past, with other datasets). The main contribution of the paper is, therefore, the choice of applying this existing method to this specific dataset with this specific neural network model. The quantitative results appear to go beyond previous attempts at video input reconstruction (although measured with distinct datasets). The conclusions have potential practical interest for the field of brain decoding, and theoretical interest for possible future uses in functional brain exploration. 

      Strengths: 

      The authors use a validated optimization method on a recent large-scale dataset, with a state-of-the-art brain encoding model. The use of an ensemble of 7 distinct model instances (trained on distinct subsets of the dataset, with distinct random initializations) significantly improves the reconstructions. The exploration of the relation between reconstruction quality and number of recorded neurons will be useful to those planning future experiments. 

      Weaknesses: 

      The main contribution is methodological, and the methodology combines pre-existing components without any new original component. 

      We thank the reviewer for their balanced assessment of our manuscript.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      This paper presents a method for reconstructing videos from mouse visual cortex neuronal activity using a state-of-the-art dynamic neural encoding model. The authors achieve high-quality reconstructions of 10-second movies at 30 Hz from two-photon calcium imaging data, reporting a 2-fold increase in pixel-by-pixel correlation compared to previous methods. They identify key factors for successful reconstruction including the number of recorded neurons and model ensembling techniques. 

      Strengths: 

      (1) A comprehensive technical approach combining state-of-the-art neural encoding models with gradient-based optimization for video reconstruction. 

      (2) Thorough evaluation of reconstruction quality across different spatial and temporal frequencies using both natural videos and synthetic stimuli. 

      (3) Detailed analysis of factors affecting reconstruction quality, including population size and model ensembling effects. 

      (4) Clear methodology presentation with well-documented algorithms and reproducible code. 

      (5) Potential applications for investigating visual processing phenomena like predictive coding and perceptual learning. 

      We thank the reviewer for taking the time to provide this valuable feedback. We would like to add that in our eyes one additional main contribution is the step of going from reconstruction of static images to dynamic videos. We trust that in the revised manuscript, we have now made the point more explicit that static image reconstruction relies on temporally averaged responses, which negates the necessity of having to account for temporal dynamics altogether. 

      Weaknesses: 

      The main metric of success (pixel correlation) may not be the most meaningful measure of reconstruction quality: 

      High correlation may not capture perceptually relevant features.

      Different stimuli producing similar neural responses could have low pixel correlations The paper doesn't fully justify why high pixel correlation is a valuable goal 

      This is a very relevant point. In retrospect, perhaps we did not justify this enough. Sensory reconstruction typically aims to reconstruct sensory input based on brain activity as faithfully as possible. A brain-to-image decoder might therefore be trained to produce images as close to the original input as possible. The loss function to train the decoder would therefore be image similarity on the pixel level. In that case, evaluating reconstruction performance based on pixel correlation is somewhat circular. 

      However, when reconstructing videos, we optimize the input video in terms of its perceptual similarity to the original video and only then evaluate pixel-level similarity. The perceptual similarity metric we optimize for is the estimate of how the neurons in mouse V1 respond to that video. We then evaluate the similarity of this perceptually optimized video to the original input video with pixel-level correlation. In other words, we optimize for perceptual similarity and then evaluate pixel similarity. If our method optimized pixel-level similarity, then we would agree that perceptual similarity is a more relevant evaluation metric. We do not think it was clear in our original submission that our optimization loss function is a perceptual loss function, and have now made this clearer in Figure 1C-D and have clarified this in the results section, line 70:

      “In effect, we optimized the input video to be perceptually similar with respect to the recorded neurons.”

      And in line 110: 

      “Because our optimization of the movies was based on a perceptual loss function, we were interested in how closely these movies matched the originals on the pixel level.”

      We chose to use pixel correlation to measure pixel-level similarity for several reasons. 1) It has been used in the past to evaluate reconstruction performance (Yoshida et al., 2020), 2) It is contrast and luminance insensitive, 3) correlation is a common metric so most readers will have an intuitive understanding of how it relates to the data. 

      To further highlight why pixel similarity might be interesting to visualize, we have included additional analysis in Figure 6 illustrating pixel-level differences between reconstructions from experimentally recorded activity and predicted activity. 

      We expect that the type of perceptual similarity the reviewer is alluding to is pretrained neural network image embedding similarity (Zhang et al., 2018: https://doi.org/10.48550/arXiv.1801.03924). While these metrics seem to match human perceptual similarity, it is unclear if they reflect mouse vision. We did try to compare the embedding similarity from pretrained networks such as VGG16, but got results suggesting the reconstructed frames were no more similar to the ground truth than random frames, which is obviously not true. This might be because the ground truth videos were too different in resolution from the training data of these networks and because these metrics are typically very sensitive to decreases in resolution. 

      The best alternative approach to evaluate mouse perceptual similarity would be to show the reconstructed videos to the same animals while recording the same neurons and to compare these neural activation patterns to those evoked by the original ground truth videos. This has been done for static images in the past: Cobos et al., bioRxiv 2022, found that static image reconstructions generated using gradient descent evoked more similar trial-averaged (40 trials) responses to those evoked by ground truth images compared to other reconstruction methods. Unfortunately, we are currently not able to perform these in vivo experiments, which is why we used publicly available data for the current paper. We plan to use this method in the future. But this method is also not flawless as it assumes that the average response to an image is the best reflection of how that image is represented, which may not be the case for an individual trial.

      As far as we are aware, there is currently no method that, given a particular activity pattern in response to an image/video, can produce an image/video that induces a neural activity pattern that is closer to the original neural response than simply showing the same image/video again. Hypothetically, such a stimulus exists because of various visual processing phenomena we mention in our discussion (e.g., predictive coding and selective attention), which suggest that the image that is represented by a population of neurons likely differs from the original sensory input. In other words, what the brain represents is an interpretation of reality not a pure reflection. Experimentally verifying this is difficult, as these variations might be present on a single trial level. The first step towards establishing a method that captures the visual representation of a population of neurons is sensory reconstruction, where the aim is to get as close as possible to the original sensory input. We think pixel-level correlation is a stringent and interpretable metric for this purpose, particularly when optimizing for perceptual similarity rather than image similarity directly.

      Comparison to previous work (Yoshida et al.) has methodological concerns: Direct comparison of correlation values across different datasets may be misleading; Large differences in the number of recorded neurons (10x more in the current study); Different stimulus types (dynamic vs static) make comparison difficult; No implementation of previous methods on the current dataset or vice versa. 

      Yes, we absolutely agree that direct comparison to previous static image reconstruction methods is problematic. We primarily do so because we think it is standard practice to give related baselines. We agree that direct comparison of the performance of video reconstruction methods to image reconstruction methods is not really possible. It does not make sense to train and apply a dynamic model on a static image data set where neural activity is time-averaged, as the temporal kernels could not be learned. Conversely, for a static model, which expects a single image as input and predicts time averaged responses, it does not make sense to feed it a series of temporally correlated movie frames and to simply concatenate the resulting activity perdition. The static model would need to be substantially augmented to incorporate temporal dynamics, which in turn would make it a new method. This puts us in the awkward position of being expected to compare our video reconstruction performance to previous image reconstruction methods without a fair way of doing so. We have now added these caveats in line 119:

      “However, we would like to stress that directly comparing static image reconstruction methods with movie reconstruction approaches is fundamentally problematic, as they rely on different data types both during training and evaluation (temporally averaged vs continuous neural activity, images flashed at fixed intervals vs continuous movies).”

      We have also toned down the language, emphasising the comparison to previous image reconstruction performance in the abstract, results, and conclusion. 

      Abstract: We removed “We achieve a ~2-fold increase in pixel-by-pixel correlation compared to previous state-of-the-art reconstructions of static images from mouse V1, while also capturing temporal dynamics.” and replaced with “We achieve a pixel-level correction of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses.”

      Discussion: we removed “In conclusion, we reconstruct videos presented to mice based on the activity of neurons in the mouse visual cortex, with a ~2-fold improvement in pixel-by-pixel correlation compared to previous static image reconstruction methods.” and replaced with “In conclusion, we reconstruct videos presented to mice based on single-trial activity of neurons in the mouse visual cortex.”

      We have also removed the performance table and have instead added supplementary figure 3 with in-depth comparison across different versions of our reconstruction method (variations of masking, ensembling, contrast & luminance matching, and Gaussian blurring). 

      Limited exploration of how the reconstruction method could provide insights into neural coding principles beyond demonstrating technical capability. 

      The aim of this paper was not to reveal principles of neural coding. Instead, we aimed to achieve the best possible performance of video reconstructions and to quantify the limitations. But to highlight its potential we have added two examples of how sensory reconstruction has been applied in human vision research in line 321: 

      “Although fMRI-based reconstruction techniques are starting to be used to investigate visual phenomena in humans (such as illusions [Cheng et al., 2023] and mental imagery [Shen et al., 2019; Koide-Majima et al., 2024; Kalantari et al., 2025]), visual processing phenomena are likely difficult to investigate using existing fMRI-based reconstruction approaches, due to the low spatial and temporal resolution of the data.”

      We have also added a demonstration of how this method could be used to investigate which parts of a reconstruction from a single trial response differs from the model's prediction (Figure  6). We do this by calculating pixel-level differences between reconstructions from the recorded neural activity and reconstructions from the expected neural activity (predicted activity by the neural encoding model). Although difficult to interpret, this pixel-by-pixel error map could represent trial-by-trial deviations of the neural code from pure sensory representation. But at this point we cannot know whether these errors are nothing more than errors in the reconstruction process. To derive meaningful interpretations of these maps would require a substantial amount of additional work and in vivo experiments and so is outside the scope of this paper, but we include this additional analysis now to highlight a) why pixel-level similarity might be interesting to quantify and visualize and b) to demonstrate how video reconstruction could be used to provide insights into neural coding, namely as a tool to identify how sensory representations differ from a pure reflection of the visual input.  

      The claim that "stimulus reconstruction promises a more generalizable approach" (line 180) is not well supported with concrete examples or evidence. 

      What we mean by generalizable is the ability to apply reconstruction to novel stimuli, which is not possible for stimulus classification. We now explain this better in the paragraph in line 211: 

      “Stimulus identification, i.e. identifying the most likely stimulus from a constrained set, has been a popular approach for quantifying whether a population of neurons encodes the identity of a particular stimulus [Földiák, 1993, Kay et al., 2008]. This approach has, for instance, been used to decode frame identity within a movie [Deitch et al., 2021, Xia et al., 2021, Schneider et al., 2023, Chen et al.,2024]. Some of these approaches have also been used to reorder the frames of the ground truth movie [Schneider et al., 2023] based on the decoded frame identity. Importantly, stimulus identification methods are distinct from stimulus reconstruction where the aim is to recreate what the sensory content of a neuronal code is in a way that generalizes to new sensory stimuli [Rakhimberdina et al., 2021]. This is inherently a more demanding task because the range of possible solutions is much larger. Although stimulus identification is a valuable tool for understanding the information content of a population code, stimulus reconstruction could provide a more generalizable approach, because it can be applied to novel stimuli.”

      All the stimuli we reconstructed were not in the training set of the model, i.e., novel. We have also downed down the claim: we have replaced “promises” with “could provide”. 

      The paper would benefit from addressing how the method handles cases where different stimuli produce similar neural responses, particularly for high-speed moving stimuli where phase differences might be lost in calcium imaging temporal resolution. 

      Thank you for this suggestion, we think this is a great question. Calcium dynamics are slow and some of the high temporal frequency information could indeed be lost, particularly phase information. In other words, when the stimulus has high temporal frequency information, it is harder to decode spatial information because of the slow calcium dynamics. Ideally, we would look at this effect using the drifting grating stimuli; however, this is problematic because we rely on predicted activity from the SOTA DNEM, and due to the dilation of the first convolution, the periodic grating stimulus causes aliasing. At 15Hz, when the temporal frequency of the stimulus is half the movie frame rate, the model is actually being given two static images, and so the predicted activity is the interleaved activity evoked by two static images. We therefore do not think using the grating stimuli is a good idea. But we have used the Gaussian stimuli as it is not periodic, and is therefore less of a problem. 

      We have now also reconstructed phase-inverted Gaussian noise stimuli and plotted the video correlation between the reconstructions from activity evoked by phase-inverted stimuli. On the one hand, we find that even for the fastest changing stimuli, the correlation between the reconstructions from phase inverted stimuli are negative, meaning phase information is not lost at high temporal frequencies. On the other hand, for the highest spatial frequency stimuli, the correlation is negative. So, the predicted neural activity (and therefore the reconstructions) are phase-insensitive when the spatial frequency is higher than the reconstruction resolution limit we identified (spatial length constant of 1 pixel, or 3.38 degrees). Beyond this limit, the DNEM predicts activity in response to phase-inverted stimuli, which, when used for reconstruction, results in movies which are more similar to each other than the stimulus that actually evokes them. 

      However, not all information is lost at these high spatial frequencies. If we plot the Shannon entropy in the spatial domain or the motion energy in the temporal domain, we find that even when the reconstructions fail to capture the stimulus at a pixel-specific level (spatial length constant of 1 pixel, or 3.38 degrees), they do capture the general spatial and temporal qualities of the videos. 

      We have added these additional analyses to Figure 4 and Supplementary Figure 5.

      Reviewer #2 (Public review): 

      This is an interesting study exploring methods for reconstructing visual stimuli from neural activity in the mouse visual cortex. Specifically, it uses a competition dataset (published in the Dynamic Sensorium benchmark study) and a recent winning model architecture (DNEM, dynamic neural encoding model) to recover visual information stored in ensembles of the mouse visual cortex. 

      This is a great project - the physiological data were measured at a single-cell resolution, the movies were reasonably naturalistic and representative of the real world, the study did not ignore important correlates such as eye position and pupil diameter, and of course, the reconstruction quality exceeded anything achieved by previous studies. Overall, it is great that teams are working towards exploring image reconstruction. Arguably, reconstruction may serve as an endgame method for examining the information content within neuronal ensembles - an alternative to training interminable numbers of supervised classifiers, as has been done in other studies. Put differently, if a reconstruction recovers a lot of visual features (maybe most of them), then it tells us a lot about what the visual brain is trying to do: to keep as much information as possible about the natural world in which its internal motor circuits may act consequently. 

      While we enjoyed reading the manuscript, we admit that the overall advance was in the range of those that one finds in a great machine learning conference proceedings paper. More specifically, we found no major technical flaws in the study, only a few potential major confounds (which should be addressable with new analyses), and the manuscript did not make claims that were not supported by its findings, yet the specific conceptual advance and significance seemed modest. Below, we will go through some of the claims, and ask about their potential significance. 

      We thank the reviewer for the positive feedback on our paper.

      (1) The study showed that it could achieve high-quality video reconstructions from mouse visual cortex activity using a neural encoding model (DNEM), recovering 10-second video sequences and approaching a two-fold improvement in pixel-by-pixel correlation over attempts. As a reader, I am left with the question: okay, does this mean that we should all switch to DNEM for our investigations of the mouse visual cortex? What makes this encoding model special? It is introduced as "a winning model of the Sensorium 2023 competition which achieved a score of 0.301... single-trial correlation between predicted and ground truth neuronal activity," but as someone who does not follow this competition (most eLife readers are not likely to do so, either), I do not know how to gauge my response. Is this impressive? What is the best achievable score, in theory, given data noise? Is the model inspired by the mouse brain in terms of mechanisms or architecture, or was it optimized to win the competition by overfitting it to the nuances of the data set? Of course, I know that as a reader, I am invited to read the references, but the study would stand better on its own if clarified how its findings depended on this model. 

      This is a very good point. We do not think that everyone should switch to using this particular DNEM to investigate the mouse visual cortex, but we think DNEMs and stimulus reconstruction in general has a lot of potential. We think static neural encoding models have already been demonstrated to be an extremely valuable tool to investigate visual coding (Walker et al., 2019; Yoshida et al., 2021; Willeke et al., bioRxiv 2023). DNEMs are less common, largely because they are very large and are technically more demanding to train and use. That makes static encoding models more practical for some applications, but they do not have temporal kernels and are therefore only used for static stimuli. They cannot, for instance, encode direction tuning, only orientation tuning. But both static and dynamic encoding models have advantages over stimulus classification methods which we outline in our discussion. Here we provide the first demonstration that previous achievements in static image reconstruction are transferable to movies.

      It has been shown in the past for static neural encoding models that choosing a better-performing model produces reconstructed static images that are closer to the original image (Pierzchlewicz et al., 2023). The factors in choosing this particular DNEM were its capacity to predict neural activity (benchmarked against other models), it was open source, and the data it was designed for was also available. 

      To give more context to the model used in the paper, we have included the following, line 348:

      “This model achieved an average single-trial correlation between predicted and ground truth neural activity of 0.291 during the competition, this was later improved to 0.301. The competition benchmark models achieved 0.106, 0.164 and 0.197 single-trial correlation, while the third and second place models achieved 0.243 and 0.265. Across the models, a variety of architectural components were used, including 2D and 3D convolutional layers, recurrent layers, and transformers, to name just a few.” 

      Concerning biologically inspired model design. The winning model contained 3 fully connected layers comprising the “Cortex” just before the final readout of neural activity, but we would consider this level of biological inspiration as minor. We do not think that the exact architecture of the model is particularly important, as the crucial aspect of such neural encoders is their ability to predict neural activity irrespective of how they achieve it. There has been a move towards creating foundation models of the brain (Wang et al., 2025) and the priority so far has been on predictive performance over mechanistic interpretability or similarity to biological structures and processes. 

      Finally, we would like to note that we do not know what the maximum theoretical score for single-trial responses might be, and don't think there is a good way of estimating it in this context. 

      (2) Along those lines, two major conclusions were that "critical for high-quality reconstructions are the number of neurons in the dataset and the use of model ensembling." If true, then these principles should be applicable to networks with different architectures. How well can they do with other network types? 

      This is a good question. Our method critically relies on the accurate prediction of neural activity in response to new videos. It is therefore expected that a model that better predicts neural responses to stimuli will also be better at reconstructing those stimuli given population activity. This was previously shown for static images (Pierzchlewicz et al., 2023). It is also expected that whenever the neural activity is accurately predicted, the corresponding reconstructed frames will also be more similar to the ground truth frames. We have now demonstrated this relationship between prediction accuracy and reconstruction accuracy in supplementary figure 4.

      Although it would be interesting to compare the movie reconstruction performance of many different models with different architectures and activity prediction performances, this would involve quite substantial additional work because movie reconstruction is very resource- and time-intensive. Finding optimal hyperparameters to make such a comparison fair and informative would therefore be impractical and likely not yield surprising results. 

      We also think it is unlikely that ensembling would not improve reconstruction performance in other models because ensembling across model predictions is a common way of improving single-model performance in machine learning. Likewise, we think it is unlikely that the relationship between neural population size and reconstruction performance would differ substantially when using different models, because using more neurons means that a larger population of noisy neurons is “voting” on what the stimulus is. However, we would expect that if the model were worse at predicting neural activity, then more neurons are needed for an equivalent reconstruction performance. In general, we would recommend choosing the best possible DNEM available, in terms of neural activity prediction performance, when reconstructing movies using input optimization through gradient descent. 

      (3) One major claim was that the quality of the reconstructions depended on the number of neurons in the dataset. There were approximately 8000 neurons recorded per mouse. The correlation difference between the reconstruction achieved by 1 neuron and 8000 neurons was ~0.2. Is that a lot or a little? One might hypothesize that ~7,999 additional neurons could contribute more information, but perhaps, those neurons were redundant if their receptive fields were too close together or if they had the same orientation or spatiotemporal tuning. How correlated were these neurons in response to a given movie? Why did so many neurons offer such a limited increase in correlation? 

      In the population ablation experiments, we compared the performance using ~1000, ~2000, ~4000, ~8000 neurons, and found an attenuation of 39.5% in video correlation when dropping 87.5% of the neurons (~1000 neurons remaining), we did not try reconstruction using just 1 neuron. 

      (4) On a related note, the authors address the confound of RF location and extent. The study resorted to the use of a mask on the image during reconstruction, applied during training and evaluation (Line 87). The mask depends on pixels that contribute to the accurate prediction of neuronal activity. The problem for me is that it reads as if the RF/mask estimate was obtained during the very same process of reconstruction optimization, which could be considered a form of double-dipping (see the "Dead salmon" article, https://doi.org/10.1016/S1053-8119(09)71202-9). This could inflate the reconstruction estimate. My concern would be ameliorated if the mask was obtained using a held-out set of movies or image presentations; further, the mask should shift with eye position, if it indeed corresponded to the "collective receptive field of the neural population." Ideally, the team would also provide the characteristics of these putative RFs, such as their weight and spatial distribution, and whether they matched the biological receptive fields of the neurons (if measured independently). 

      We can reassure the reviewer that there is no double-dipping. We would like to clarify that the mask was trained only on videos from the training set of the DNEM and not the videos which were reconstructed. We have added the sentence, line 91: 

      “None of the reconstructed movies were used in the optimization of this transparency mask.”

      Making the mask dependent on eye position would be difficult to implement with the current DNEM, where eye position is fed to the model as an additional channel. When using a model where the image is first transformed into retinotopic coordinates in an eye position-dependent manner (such as in Wang et al., 2025) the mask could be applied in retinotopic coordinates and therefore be dependent on eye position. 

      Effectively, the alpha mask defines the relative level of influence each pixel contributes to neural activity prediction. We agree it is useful to compare the shape of the alpha mask with the location of traditional on-off receptive fields (RFs) to clarify what the alpha mask represents and characterise the neural population available for our reconstructions. We therefore presented the DNEM with on-off patches to map the receptive fields of single neurons in an in silico experiment (the experimentally derived RF are not available). As expected, there is a rough overlap between the alpha mask (Supplementary Figure 2D), the average population receptive field (Supplementary Figure 2B), and the location of receptive field peaks (Supplementary Figure 2C). In principle, all three could be used during training or evaluation for masking, but we think that defining a mask based on the general influence of images on neural activity, rather than just on off patch responses, is a more elegant solution.

      One idea of how to go a step further would be to first set the alpha mask threshold during training based on the % loss of neural activity prediction performance that threshold induces (in our case alpha=0.5 corresponds to ~3% loss in correlation between predicted vs recorded neural responses, see Supplementary Figure 3D), and second base the evaluation mask on a pixel correlation threshold (see example pixel correlation map in Supplementary Figure 2E) instead to avoid evaluating areas of the image with low image reconstruction confidence. 

      We referred to this figure in the result section, line 83:

      “The transparency masks are aligned with but not identical to the On-Off receptive field distribution maps using sparse-noise (Figure S2).” 

      We have also done additional analysis on the effect of masking during training and evaluation with different thresholds in Supplementary Figure 3.

      (5) We appreciated the experiments testing the capacity of the reconstruction process, by using synthetic stimuli created under a Gaussian process in a noise-free way. But this further raised questions: what is the theoretical capability for the reconstruction of this processing pipeline, as a whole? Is 0.563 the best that one could achieve given the noisiness and/or neuron count of the Sensorium project? What if the team applied the pipeline to reconstruct the activity of a given artificial neural network's layer (e.g., some ResNet convolutional layer), using hidden units as proxies for neuronal calcium activity? 

      That’s a very interesting point. It is very hard to know what the theoretical best reconstruction performance of the model would be. Reconstruction performance could be decreased due to neural variability, experimental noise, the temporal kernel of the calcium indicator and the imaging frame rate, information compression along the visual hierarchy, visual processing phenomena (such as predictive coding and selective attention), failure of the model to predict neural activity correctly, or failure of the reconstruction process to find the best possible image which explains the neural activity. We don't think we can disentangle the contribution of all these sources, but we can provide a theoretical maximum assuming that the model and the reconstruction process are optimal. To that end, we performed additional simulations and reconstructed the natural videos using the predicted activity of the neurons in response to the natural videos as the target (similar to the synthetic stimuli) and got a correlation of 0.766. So, the single trial performance of 0.569 is ~75% of this theoretical maximum. This difference can be interpreted as a combination of the losses due to neuronal variability, measurement noise, and actual deviations in the images represented by the brain compared to reality. 

      We thank the reviewer for this suggestion, as it gave us the idea of looking at error maps (Figure 6), where the pixel-level deviation of the reconstructions from recorded vs predicted activity is overlaid on the ground truth movie.

      (6) As the authors mentioned, this reconstruction method provided a more accurate way to investigate how neurons process visual information. However, this method consisted of two parts: one was the state-of-the-art (SOTA) dynamic neural encoding model (DNEM), which predicts neuronal activity from the input video, and the other part reconstructed the video to produce a response similar to the predicted neuronal activity. Therefore, the reconstructed video was related to neuronal activity through an intermediate model (i.e., SOTA DNEM). If one observes a failure in reconstructing certain visual features of the video (for example, high-spatial frequency details), the reader does not know whether this failure was due to a lack of information in the neural code itself or a failure of the neuronal model to capture this information from the neural code (assuming a perfect reconstruction process). Could the authors address this by outlining the limitations of the SOTA DNEM encoding model and disentangling failures in the reconstruction from failures in the encoding model? 

      To test if a better neural prediction by the DNEM would result in better reconstructions, we ran additional simulations and now show that neural activity prediction performance correlates with reconstruction performance (Supplementary Figure 4B). This is consistent with Pierzchlewicz et al., (2023) who showed that static image reconstructions using better encoding models leads to better reconstruction performance. As also mentioned in the answer to the previous comment, untangling the relative contributions of reconstruction losses is hard, but we think that improvements to the DNEM performance are key. Two suggestions to improving the DNEM we used would be to translate the input image into retinotopic coordinates and shift this image relative to eye position before passing it to the first convolutional layer (as is done in Wang et al. 2025), to use movies which are not spatially down sampled as heavily, to not use a dilation of 2 in the temporal convolution of the first layer and to train on a larger dataset. 

      (7) The authors mentioned that a key factor in achieving high-quality reconstructions was model assembling. However, this averaging acts as a form of smoothing, which reduces the reconstruction's acuity and may limit the high-frequency content of the videos (as mentioned in the manuscript). This averaging constrains the tool's capacity to assess how visual neurons process the low-frequency content of visual input. Perhaps the authors could elaborate on potential approaches to address this limitation, given the critical importance of high-frequency visual features for our visual perception. 

      This is exactly what we also thought. To answer this point more specifically, we ran additional simulations where we also reconstruct the movies using gradient ensembling instead of reconstruction ensembling. Here, the gradients of the loss with respect to each pixel of the movie is calculated for each of the model instances and are averaged at every iteration of the reconstruction optimization. In essence, this means that one reconstruction solution is found, and the averaging across reconstructions, which could degrade high-frequency content, is skipped. The reconstructions from both methods look very similar, and the video correlation is, if anything, slightly worse (Supplemental Figure 3A&C). This indicates that our original ensembling approach did not limit reconstruction performance, but that both approaches can be used, depending on what is more convenient given hardware restrictions. 

      Reviewer #3 (Public review): 

      Summary: 

      This paper presents a method for reconstructing input videos shown to a mouse from the simultaneously recorded visual cortex activity (two-photon calcium imaging data). The publicly available experimental dataset is taken from a recent brain-encoding challenge, and the (publicly available) neural network model that serves to reconstruct the videos is the winning model from that challenge (by distinct authors). The present study applies gradient-based input optimization by backpropagating the brain-encoding error through this selected model (a method that has been proposed in the past, with other datasets). The main contribution of the paper is, therefore, the choice of applying this existing method to this specific dataset with this specific neural network model. The quantitative results appear to go beyond previous attempts at video input reconstruction (although measured with distinct datasets). The conclusions have potential practical interest for the field of brain decoding, and theoretical interest for possible future uses in functional brain exploration. 

      Strengths: 

      The authors use a validated optimization method on a recent large-scale dataset, with a state-of-the-art brain encoding model. The use of an ensemble of 7 distinct model instances (trained on distinct subsets of the dataset, with distinct random initializations) significantly improves the reconstructions. The exploration of the relation between reconstruction quality and the number of recorded neurons will be useful to those planning future experiments. 

      Weaknesses: 

      The main contribution is methodological, and the methodology combines pre-existing components without any new original components. 

      We thank the reviewer for taking the time to review our paper and for their overall positive assessment. We would like to emphasise that combining pre-existing machine learning techniques to achieve top results in a new modality does require iteration and innovation. While gradient-based input optimization by backpropagating the brain-encoding error through a neural encoding model has been used in 2D static image optimization to generate maximally exciting images and reconstruct static images, we are the first to have applied it to movies which required accounting for the time domain. Previous methods used time averaged responses and were limited to the reconstruction of static images presented with fixed image intervals.

      The movie reconstructions include a learned "transparency mask" to concentrate on the most informative area of the frame; it is not clear how this choice impacts the comparison with prior experiments. Did they all employ this same strategy? If not, shouldn't the quantitative results also be reported without masking, for a fair comparison? 

      Yes, absolutely. All reconstruction approaches limit the field of view in some way, whether this is due to the size of the screen, the size of the image on the screen, or cropping of the presented/reconstructed images during analysis due to the retinotopic coverage of the recorded neurons. Note that we reconstruct a larger field of view than Yoshida et al. In Yoshida et al., the reconstructed field of view was 43 by 43 retinal degrees. we show the size of an example evaluation mask in comparison. 

      To address the reviewer’s concern more specifically, we performed additional simulations and now also show the performance using a variety of different training and evaluation masks, including different alpha thresholds for training and evaluation masks as well as the effective retinotopic coverage at different alpha thresholds. Despite these comparisons, we would also like to highlight that the comparison to the benchmark is problematic itself. This is because image and movie reconstruction are not directly comparable. It does not make sense to train and apply a dynamic model on a static image dataset where neural activity is time averaged. Conversely, it does not make sense to train or apply a static model that expects time-averaged neural responses on continuous neural activity unless it is substantially augmented to incorporate temporal dynamics, which in turn would make it a new method. This puts us in the awkward position of being expected to compare our video reconstruction performance to previous image reconstruction methods without a fair way of doing so. We have therefore de-emphasised the phrasing comparing our method to previous publications in the abstract, results, and discussion. 

      Abstract: “We achieve a ~2-fold increase in pixel-by-pixel correlation compared to previous state-of-the-art reconstructions of static images from mouse V1, while also capturing temporal dynamics.” with “We achieve a pixel-level correction of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses.”

      Results: “This represents a ~2x higher pixel-level correlation over previous single-trial static image reconstructions from V1 in awake mice (image correlation 0.238 +/- 0.054 s.e.m for awake mice) [Yoshida et al., 2020] over a similar retinotopic area (~43° x 43°) while also capturing temporal dynamics. However, we would like to stress that directly comparing static image reconstruction methods with movie reconstruction approaches is fundamentally problematic, as they rely on different data types both during training and evaluation (temporally averaged vs continuous neural activity, images flashed at fixed intervals vs continuous movies).”

      Discussion: “In conclusion, we reconstruct videos presented to mice based on the activity of neurons in the mouse visual cortex, with a ~2-fold improvement in pixel-by-pixel correlation compared to previous static image reconstruction methods.” with “In conclusion, we reconstruct videos presented to mice based on single-trial activity of neurons in the mouse visual cortex.”

      We have also removed the performance table and have instead added supplementary figure 3 with in-depth comparison across different versions of our reconstruction method (variations of masking, ensembling, contrast & luminance matching, and Gaussian blurring). 

      We believe that we have given enough information in our paper now so that readers can make an informed decision whether our movie reconstruction method is appropriate for the questions they are interested in.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors): 

      (1) "Reconstructions have been luminance (mean pixel value across video) and contrast (standard deviation of pixel values across video) matched to ground truth." This was not clear: was it done by the investigating team? I imagine that one of the most easily captured visual features is luminance and contrast, why wouldn't the optimization titrate these well? 

      The contrast and luminance matching of the reconstructions to the ground truth videos was done by us, but this was only done to help readers assess the quality of the reconstructions by eye. Our performance metrics (frame and video correlation) are contrast and luminance insensitive. To clarify this, we have also added examples of non-adjusted frames in Supplementary Figure 3A, and added a sentence in the results, line 103: 

      “When presenting videos in this paper we normalize the mean and standard deviation of the reconstructions to the average and standard deviation of the corresponding ground truth movie before applying the evaluation masks, but this is not done for quantification except in Supplementary Figure 3D.”

      We were also initially surprised that contrast and luminance are not captured well by our reconstruction method, but this makes sense as V1 is largely luminance invariant (O’Shea et al., 2025 https://doi.org/10.1016/j.celrep.2024.115217 ) and contrast only has a gain effect on V1 activity (Tring et al., 2024 https://journals.physiology.org/doi/full/10.1152/jn.00336.2024). Decoding absolute contrast is likely unreliable because it is probably not the only factor modulating the overall gain of the neural population.

      To address the reviewer’s comment more fully, we ran additional experiments. More specifically, to test why contrast and luminance are not recovered in the reconstructions, we checked how the predicted activity between the reconstruction and the contrast/luminance corrected reconstructions differs. Contrast and luminance adjustment had little impact on predicted response similarity on average. This makes the reconstruction optimization loss function insensitive to overall contrast and luminance so it cannot be decoded. There is a small effect on activity correlation, however, so we cannot completely rule out that contrast and luminance could be reconstructed with a different loss function. 

      (2) The authors attempted to investigate the variability in reconstruction quality across different movies and 10-second snippets of a movie by correlating various visual features, such as video motion energy, contrast, luminance, and behavioral factors like running speed, pupil diameter, and eye movement, with reconstruction success. However, it would also be beneficial if the authors correlated the response loss (Poisson loss between neural responses) with reconstruction quality (video correlation) for individual videos, as these metrics are expected to be correlated if the reconstruction captures neural variance. 

      We thank the reviewer for this suggestion. We have now included this analysis and find that if the neural activity was better predicted by the DNEM then the reconstruction of the video was also more similar to the ground truth video. We further found that this effect is shift-dependent (in time), meaning the prediction of activity based on proximal video frames is more influential on reconstruction performance. 

      Reviewer #3 (Recommendations for the authors): 

      (1) I was confused about the choice of applying a transparency mask thresholded with alpha>0.5 during training and alpha>1 during evaluation. Why treat the two situations differently? Also, shouldn't we expect alpha to be in the [0,1] range, in which case, what is the meaning of alpha>1? (And finally, as already described in "Weaknesses", how does this choice impact the comparison with prior experiments? Did they also employ a similar masking strategy?) 

      We found that applying a mask during training increased performance regardless of the size of the evaluation mask. Using a less stringent mask during training than during evaluation increases performance slightly, but also allows inspection of the reconstruction in areas where the model will be less confident without sacrificing performance, if this is desired. The thresholds of 0.5 and 1 were chosen through trial and error, but the exact values do not hold intrinsic meaning. The alpha mask values can go above 1 during their optimization. We could have clipped alpha during the training procedure (algorithm 1), but we decided this was not worth redoing at this stage, as the alphas used for testing were not above 1. All reconstruction approaches in previous publications limit the field of view in some form, whether this is due to the size of the screen, the size of the image on the screen, or the cropping of the presented/reconstructed images during analysis. 

      To address the reviewer’s comment in detail, we have added extensive additional analysis to evaluate the coverage of the reconstruction achieved in this paper and how different masking strategies affect performance, as well as how the mask relates to more traditional receptive field mapping.  

      (2) I would not use the word "imagery" in the first sentence of the abstract, because this might be interpreted by some readers as reconstruction of mental imagery, a very distinct question. 

      We changed imagery to images in the abstract.

      (3) Line 145-146: "<1 frame, or <30Hz" should be "<1 frame, or >30Hz". 

      We have corrected the error.

      (4) Algorithm 1, Line 5, a subscript variable 'g' should be changed to 'h'

      We have corrected the error.

      Additional Changes

      (1) Minor grammatical errors

      (2) Addition of citations: We were previously not aware of a bioRxiv preprint from 2022 (Cobos et al., 2022), which used gradient descent-based input optimization to reconstruct static images but without the addition of a diffusion model. Instead, we had cited for this method Pierzchlewicz et al., 2023 bioRxiv/NeurIPS. In Cobos et al., 2022, they compare static image reconstruction similarity to ground truth images and the similarity of the in vivo evoked activity across multiple reconstruction methods. Performance values are only given for reconstructions from trial-averaged responses across ~40 trials (in the absence of original data or code we are also not able to retrospectively calculate single-trial performance). The authors find that optimizing for evoked activity rather than image similarity produces image reconstructions that evoke more similar in vivo responses compared to reconstructions optimized for image similarity itself. We have now added and discussed the citation in the main text. 

      (3) Workaround for error in the open-source code from https://github.com/lRomul/sensorium for video hashing function in the SOTA DNEM: By checking the most correlated first frame for each reconstructed movie, we discovered there was a bug in the open-source code and 9/50 movies we originally used for reconstruction were not properly excluded from the training data between DNEM instances. The reason for this error was that some of the movies are different by only a few pixels, and the video hashing function used to split training and test set folds in the original DNEM code classified these movies as different and split them across folds. We have replaced these 9 movies and provide a figure below showing the next closest first frame for every movie clip we reconstruct. This does not affect our claims. Excluding these 9 movie clips, did not affect the reconstruction performance (video correlation went from 0.563 to 0.568), so there was no overestimation of performance due to test set contamination. However, they should still be removed so some of the values in the paper have changed slightly. The only statistical test that was affected was the correlation between video correlation and mean motion energy (Supplementary Figure 4A), which went from p = 0.043 to 0.071. 

      Author response image 2.

      exclusion of movie clips with duplicates in the DNEM training data. A) example frame of a reconstructed movie (ground truth) and the most correlated first frame from the training data. b) all movie clips and their corresponding most correlated clip from the training data. Red boxes indicate excluded duplicates. 

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review)

      Major:

      (1) In line 76, the authors make a very powerful statement: 'σRNN simulation achieves higher similarity with unseen recorded trials before perturbation, but lower than the bioRNN on perturbed trials.' I couldn't find a figure showing this. This might be buried somewhere and, in my opinion, deserves some spotlight - maybe a figure or even inclusion in the abstract.

      We agree with the reviewer that these results are important. The failure of σRNN on perturbed data could be inferred from the former Figures 1E, 2C-E, and 3D. Following the reviewers' comments, we have tried to make this the most prominent message of Figure 1, in particular with the addition of the new panel E. We also moved Table 1 from the  Supplementary to the main text to highlight this quantitatively. 

      (2) It's mentioned in the introduction (line 84) and elsewhere (e.g., line 259) that spiking has some advantage, but I don't see any figure supporting this claim. In fact, spiking seems not to matter (Figure 2C, E). Please clarify how spiking improves performance, and if it does not, acknowledge that. Relatedly, in line 246, the authors state that 'spiking is a better metric but not significant' when discussing simulations. Either remove this statement and assume spiking is not relevant, or increase the number of simulations.

      We could not find the exact quote from the reviewer, and we believe that he intended to quote “spiking is better on all metrics, but without significant margins”. Indeed, spiking did not improve the fit significantly on perturbed trials, this is particularly true in comparison with the benefits of Dale’s law and local inhibition. As suggested by the reviewer, we rephrased the sentence from this quote and more generally the corresponding paragraphs in the intro (lines 83-87) and in the results (lines 245-271). Our corrections in the results sections are also intended to address the minor point (4) raised by the same reviewer.

      (3) The authors prefer the metric of predicting hits over MSE, especially when looking at real data (Figure 3). I would bring the supplementary results into the main figures, as both metrics are very nicely complementary. Relatedly, why not add Pearson correlation or R2, and not just focus on MSE Loss?

      In Figure 3 for the in-vivo data, we do not have simultaneous electrophysiological recordings and optogenetic stimulation in this dataset.  The two are performed on different recording sessions. Therefore, we can only compare the effect of optogenetics on the behavior, and we cannot compute Pearson correlation or R2 of the perturbed network activity. To avoid ambiguity, we wrote “For the sessions of the in vivo dataset with optogenetic perturbation that we considered, only the behavior of an animal is recorded” on line 294. 

      (4) I really like the 'forward-looking' experiment in closed loop! But I felt that the relevance of micro perturbations is very unclear in the intro and results. This could be better motivated: why should an experimentalist care about this forward-looking experiment? Why exactly do we care about micro perturbation (e.g., in contrast to non-micro perturbation)? Relatedly, I would try to explain this in the intro without resorting to technical jargon like 'gradients'.

      As suggested, we updated the last paragraph of the introduction (lines 88 - 95) to give better motivation for why algorithmically targeted acute spatio-temporal perturbations can be important to dissect the function of neural circuits. We also added citations to recent studies with targeted in vivo optogenetic stimulation. As far as we know the existing previous work targeted network stimulation mostly using linear models, while we used non-linear RNNs and their gradients.

      Minor:

      (1) In the intro, the authors refer to 'the field' twice. Personally, I find this term odd. I would opt for something like 'in neuroscience'.

      We implemented the suggested change: l.27 and l.30

      (2) Line 45: When referring to previous work using data-constrained RNN models, Valente et al. is missing (though it is well cited later when discussing regularization through low-rank constraints)

      We added the citation: l.45

      (3) Line 11: Method should be methods (missing an 's').

      We fixed the typo.

      (4) In line 250, starting with 'So far', is a strange choice of presentation order. After interpreting the results for other biological ingredients, the authors introduce a new one. I would first introduce all ingredients and then interpret. It's telling that the authors jump back to 2B after discussing 2C.

      We restructured the last two paragraphs of section 2.1, and we hope that the presentation order is now more logical.

      (5) The black dots in Figure 3E are not explained, or at least I couldn't find an explanation.

      We added an explanation in the caption of Figure 3E.

      Reviewer #2 (Public review):

      (1) Some aspects of the methods are unclear. For comparisons between recurrent networks trained from randomly initialized weights, I would expect that many initializations were made for each model variant to be compared, and that the performance characteristics are constructed by aggregating over networks trained from multiple random initializations. I could not tell from the methods whether this was done or how many models were aggregated.

      The expectation of the reviewer is correct, we trained multiple models with different random seeds (affecting both the weight initialization and the noise of our model) for each variant and aggregated the results. We have now clarified this in Methods 4.6. lines 658-662.

      (2) It is possible that including perturbation trials in the training sets would improve model performance across conditions, including held-out (untrained) perturbations (for instance, to units that had not been perturbed during training). It could be noted that if perturbations are available, their use may alleviate some of the design decisions that are evaluated here.

      In general, we agree with the reviewer that including perturbation trials in the training set would likely improve model performance across conditions. One practical limitation explaining partially why we did not do it with our dataset is the small quantity of perturbed trials for each targeted cortical area: the number of trials with light perturbations is too scarce to robustly train and test our models.

      More profoundly, to test hard generalizations to perturbations (aka perturbation testing), it will always be necessary that the perturbations are not trivially represented in the training data. Including perturbation trials during training would compromise our main finding: some biological model constraints improve the generalization to perturbation. To test this claim, it was necessary to keep the perturbations out of the training data.

      We agree that including all available data of perturbed and non-perturbed recordings would be useful to build the best generalist predictive system. It could help, for instance, for closed-loop circuit control as we studied in Figure 5. Yet, there too, it will be important for the scientific validation process to always keep some causal perturbations of interest out of the training set. This is necessary to fairly measure the real generalization capability of any model. Importantly, this is why we think out-of-distribution “perturbation testing” is likely to have a recurring impact in the years to come, even beyond the case of optogenetic inactivation studied in detail in our paper.

      Recommendation for the authors:

      Reviewer #1 (Recommendation for the authors):

      The code is not very easy to follow. I know this is a lot to ask, but maybe make clear where the code is to train the different models, which I think is a great contribution of this work? I predict that many readers will want to use the code and so this will improve the impact of this work.

      We updated the code to make it easier to train a model from scratch.

      Reviewer #2 (Recommendation for the authors):

      The figures are really tough to read. Some of that small font should be sized up, and it's tough to tell in the posted paper what's happening in Figure 2B.

      We updated Figures 1 and 2 significantly, in part to increase their readability. We also implemented the "Superficialities" suggestions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      The authors analyzed the expression of ATAD2 protein in post-meiotic stages and characterized the localization of various testis-specific proteins in the testis of the Atad2 knockout (KO). By cytological analysis as well as the ATAC sequencing, the study showed that increased levels of HIRA histone chaperone, accumulation of histone H3.3 on post-meiotic nuclei, defective chromatin accessibility and also delayed deposition of protamines. Sperm from the Atad2 KO mice reduces the success of in vitro fertilization. The work was performed well, and most of the results are convincing. However, this manuscript does not suggest a molecular mechanism for how ATAD2 promotes the formation of testis-specific chromatin. 

      We would like to take this opportunity to highlight that the present study builds on our previously published work, which examined the function of ATAD2 in both yeast S. pombe and mouse embryonic stem (ES) cells (Wang et al., 2021). In yeast, using genetic analysis we showed that inactivation of HIRA rescues defective cell growth caused by the absence of ATAD2. This rescue could also be achieved by reducing histone dosage, indicating that the toxicity depends on histone over-dosage, and that HIRA toxicity, in the absence of ATAD2, is linked to this imbalance.

      Furthermore, HIRA ChIP-seq performed in mouse ES cells revealed increased nucleosome-bound HIRA, particularly around transcription start sites (TSS) of active genes, along with the appearance of HIRA-bound nucleosomes within normally nucleosome-free regions (NFRs). These findings pointed to ATAD2 as a major factor responsible for unloading HIRA from nucleosomes. This unloading function may also apply to other histone chaperones, such as FACT (see Wang et al., 2021, Fig. 4C).

      In the present study, our investigations converge on the same ATAD2 function in the context of a physiologically integrated mammalian system—spermatogenesis. Indeed, in the absence of ATAD2, we observed H3.3 accumulation and enhanced H3.3-mediated gene expression. Consistent with this functional model of ATAD2— unloading chaperones from histone- and non-histone-bound chromatin—we also observed defects in histone-toprotamine replacement.

      Together, the results presented here and in Wang et al. (2021) reveal an underappreciated regulatory layer of histone chaperone activity. Previously, histone chaperones were primarily understood as factors that load histones. Our findings demonstrate that we must also consider a previously unrecognized regulatory mechanism that controls assembled histone-bound chaperones. This key point was clearly captured and emphasized by Reviewer #2 (see below).

      Strengths:

      The paper describes the role of ATAD2 AAA+ ATPase in the proper localization of sperm-specific chromatin proteins such as protamine, suggesting the importance of the DNA replication-independent histone exchanges with the HIRA-histone H3.3 axis. 

      Weaknesses: 

      (1) Some results lack quantification. 

      We will consider all the data and add appropriate quantifications where necessary.

      (2) The work was performed well, and most of the results are convincing. However, this manuscript does not suggest a molecular mechanism for how ATAD2 promotes the formation of testis-specific chromatin. 

      Please see our comments above.

      Reviewer #2 (Public review): 

      Summary:

      This manuscript by Liakopoulou et al. presents a comprehensive investigation into the role of ATAD2 in regulating chromatin dynamics during spermatogenesis. The authors elegantly demonstrate that ATAD2, via its control of histone chaperone HIRA turnover, ensures proper H3.3 localization, chromatin accessibility, and histone-toprotamine transition in post-meiotic male germ cells. Using a new well-characterized Atad2 KO mouse model, they show that ATAD2 deficiency disrupts HIRA dynamics, leading to aberrant H3.3 deposition, impaired transcriptional regulation, delayed protamine assembly, and defective sperm genome compaction. The study bridges ATAD2's conserved functions in embryonic stem cells and cancer to spermatogenesis, revealing a novel layer of epigenetic regulation critical for male fertility. 

      Strengths:

      The MS first demonstration of ATAD2's essential role in spermatogenesis, linking its expression in haploid spermatids to histone chaperone regulation by connecting ATAD2-dependent chromatin dynamics to gene accessibility (ATAC-seq), H3.3-mediated transcription, and histone eviction. Interestingly and surprisingly, sperm chromatin defects in Atad2 KO mice impair only in vitro fertilization but not natural fertility, suggesting unknown compensatory mechanisms in vivo. 

      Weaknesses:

      The MS is robust and there are not big weaknesses 

      Reviewer #3 (Public review): 

      Summary: 

      The authors generated knockout mice for Atad2, a conserved bromodomain-containing factor expressed during spermatogenesis. In Atad2 KO mice, HIRA, a chaperone for histone variant H3.3, was upregulated in round spermatids, accompanied by an apparent increase in H3.3 levels. Furthermore, the sequential incorporation and removal of TH2B and PRM1 during spermiogenesis were partially disrupted in the absence of ATAD2, possibly due to delayed histone removal. Despite these abnormalities, Atad2 KO male mice were able to produce offspring normally. 

      Strengths:

      The manuscript addresses the biological role of ATAD2 in spermatogenesis using a knockout mouse model, providing a valuable in vivo framework to study chromatin regulation during male germ cell development. The observed redistribution of H3.3 in round spermatids is clearly presented and suggests a previously unappreciated role of ATAD2 in histone variant dynamics. The authors also document defects in the sequential incorporation and removal of TH2B and PRM1 during spermiogenesis, providing phenotypic insight into chromatin transitions in late spermatogenic stages. Overall, the study presents a solid foundation for further mechanistic investigation into ATAD2 function. 

      Weaknesses:

      While the manuscript reports the gross phenotype of Atad2 KO mice, the findings remain largely superficial and do not convincingly demonstrate how ATAD2 deficiency affects chromatin dynamics. Moreover, the phenotype appears too mild to elucidate the functional significance of ATAD2 during spermatogenesis. 

      We respectfully disagree with the statement that our findings are largely superficial. Based on our investigations of this factor over the years, it has become evident that ATAD2 functions as an auxiliary factor that facilitates mechanisms controlling chromatin dynamics (see, for example, Morozumi et al., 2015). These mechanisms can still occur in the absence of ATAD2, but with reduced efficiency, which explains the mild phenotype we observed.

      This function, while not essential, is nonetheless an integral part of the cell’s molecular biology and should be studied and brought to the attention of the broader biological community, just as we study essential factors. Unfortunately, the field has tended to focus primarily on core functional actors, often overlooking auxiliary factors. As a result, our decade-long investigations into the subtle yet important roles of ATAD2 have repeatedly been met with skepticism regarding its functional significance, which has in turn influenced editorial decisions.

      We chose eLife as the venue for this work specifically to avoid such editorial barriers and to emphasize that facilitators of essential functions do exist. They deserve to be investigated, and the underlying molecular regulatory mechanisms must be understood.

      (1) Figures 4-5: The analyses of differential gene expression and chromatin organization should be more comprehensive. First, Venn diagrams comparing the sets of significantly differentially expressed genes between this study and previous work should be shown for each developmental stage. Second, given the established role of H3.3 in MSCI, the effect of Atad2 knockout on sex chromosome gene expression should be analyzed. Third, integrated analysis of RNA-seq and ATAC-seq data is needed to evaluate how ATAD2 loss affects gene expression. Finally, H3.3 ChIP-seq should be performed to directly assess changes in H3.3 distribution following Atad2 knockout.  

      (1) In the revised version, we will include Venn diagrams to illustrate the overlap in significantly differentially expressed genes between this study and previous work. However, we believe that the GSEAs presented here provide stronger evidence, as they indicate the statistical significance of this overlap (p-values). In our case, we observed p-value < 0.01 (**) and p < 0.001 (***).

      (2) Sex chromosome gene expression was analyzed and is presented in Fig. 5C.

      (3) The effect of ATAD2 loss on gene expression is shown in Fig. 4A, B, and C as histograms, with statistical significance indicated in the middle panels.

      (4) Although mapping H3.3 incorporation across the genome in wild-type and Atad2 KO cells would have been informative, the available anti-H3.3 antibody did not work for ChIP-seq, at least in our hands. The authors of Fontaine et al., 2022, who studied H3.3 during spermatogenesis in mice, must have encountered the same problem, since they tagged the endogenous H3.3 gene to perform their ChIP experiments.

      (2) Figure 3: The altered distribution of H3.3 is compelling. This raises the possibility that histone marks associated with H3.3 may also be affected, although this has not been investigated. It would therefore be important to examine the distribution of histone modifications typically associated with H3.3. If any alterations are observed, ChIP-seq analyses should be performed to explore them further.

      Based on our understanding of ATAD2’s function—specifically its role in releasing chromatin-bound HIRA—in the absence of ATAD2 the residence time of both HIRA and H3.3 on chromatin increases. This results in the detection of H3.3 not only on sex chromosomes but across the genome. Our data provide clear evidence of this phenomenon. The reviewer is correct in suggesting that the accumulated H3.3 would carry H3.3-associated histone PTMs; however, we are unsure what additional insights could be gained by further demonstrating this point.

      (3) Figure 7: While the authors suggest that pre-PRM2 processing is impaired in Atad2 KO, no direct evidence is provided. It is essential to conduct acid-urea polyacrylamide gel electrophoresis (AU-PAGE) followed by western blotting, or a comparable experiment, to substantiate this claim. 

      Figure 7 does not suggest that pre-PRM2 processing is affected in Atad2 KO; rather, this figure—particularly Fig. 7B—specifically demonstrates that pre-PRM2 processing is impaired, as shown using an antibody that recognizes the processed portion of pre-PRM2. ELISA was used to provide a more quantitative assessment; however, in the revised manuscript we will also include a western blot image.

      (4) HIRA and ATAD2: Does the upregulation of HIRA fully account for the phenotypes observed in Atad2 KO? If so, would overexpression of HIRA alone be sufficient to phenocopy the Atad2 KO phenotype? Alternatively, would partial reduction of HIRA (e.g., through heterozygous deletion) in the Atad2 KO background be sufficient to rescue the phenotype? 

      These are interesting experiments that require the creation of appropriate mouse models, which are not currently available.

      (5) The mechanism by which ATAD2 regulates HIRA turnover on chromatin and the deposition of H3.3 remains unclear from the manuscript and warrants further investigation. 

      The Reviewer is absolutely correct. In addition to the points addressed in response to Reviewer #1’s general comments (see above), it would indeed have been very interesting to test the segregase activity of ATAD2 (likely driven by its AAA ATPase activity) through in vitro experiments using the Xenopus egg extract system described by Tagami et al., 2004. This system can be applied both in the presence and absence (via immunodepletion) of ATAD2 and would also allow the use of ATAD2 mutants, particularly those with inactive AAA ATPase or bromodomains. However, such experiments go well beyond the scope of this study, which focuses on the role of ATAD2 in chromatin dynamics during spermatogenesis.

      References:

      (1) Wang T, Perazza D, Boussouar F, Cattaneo M, Bougdour A, Chuffart F, Barral S, Vargas A, Liakopoulou A, Puthier D, Bargier L, Morozumi Y, Jamshidikia M, Garcia-Saez I, Petosa C, Rousseaux S, Verdel A, Khochbin S. ATAD2 controls chromatin-bound HIRA turnover. Life Sci Alliance. 2021 Sep 27;4(12):e202101151. doi: 10.26508/lsa.202101151. PMID: 34580178; PMCID: PMC8500222.

      (2) Morozumi Y, Boussouar F, Tan M, Chaikuad A, Jamshidikia M, Colak G, He H, Nie L, Petosa C, de Dieuleveult M, Curtet S, Vitte AL, Rabatel C, Debernardi A, Cosset FL, Verhoeyen E, Emadali A, Schweifer N, Gianni D, Gut M, Guardiola P, Rousseaux S, Gérard M, Knapp S, Zhao Y, Khochbin S. Atad2 is a generalist facilitator of chromatin dynamics in embryonic stem cells. J Mol Cell Biol. 2016 Aug;8(4):349-62. doi: 10.1093/jmcb/mjv060. Epub 2015 Oct 12. PMID: 26459632; PMCID: PMC4991664.

      (3) Fontaine E, Papin C, Martinez G, Le Gras S, Nahed RA, Héry P, Buchou T, Ouararhni K, Favier B, Gautier T, Sabir JSM, Gerard M, Bednar J, Arnoult C, Dimitrov S, Hamiche A. Dual role of histone variant H3.3B in spermatogenesis: positive regulation of piRNA transcription and implication in X-chromosome inactivation. Nucleic Acids Res. 2022 Jul 22;50(13):7350-7366. doi: 10.1093/nar/gkac541. PMID: 35766398; PMCID: PMC9303386.

      (4) Tagami H, Ray-Gallet D, Almouzni G, Nakatani Y. Histone H3.1 and H3.3 complexes mediate nucleosome assembly pathways dependent or independent of DNA synthesis. Cell. 2004 Jan 9;116(1):51-61. doi: 10.1016/s0092-8674(03)01064-x. PMID: 14718166.

      Recommendations for the authors:

      Reviewing Editor Comments:

      I note that the reviewers had mixed opinions about the strength of the evidence in the manuscript. A revision that addresses these points would be welcome.

      Reviewer #1 (Recommendations for the authors):  

      Major points: 

      (1) No line numbers: It is hard to point out the issues.

      The revised version harbors line numbers.

      (2) Given the results shown in Figure 3 and Figure 4, it is nice to show the chromosomal localization of histone H3.3 in spermatocytes or post-meiotic cells by Chromatin-immunoprecipitation sequencing (ChIP-seq).

      Although mapping H3.3 incorporation across the genome in wild-type and Atad2 KO cells would have been informative, the available anti-H3.3 antibody did not work for ChIP-seq in our hands. In fact, this antibody is not well regarded for ChIP-seq. For example, Fontaine et al. (2022), who investigated H3.3 during spermatogenesis in mice, circumvented this issue by tagging the endogenous H3.3 genes for their ChIP experiments.

      (3) Figure 7B and 8: Why the authors used ELISA for the protein quantification. At least, western blotting should be shown.

      ELISA is a more quantitative method than traditional immunoblotting. Nevertheless, as requested by the reviewer, we have now included a corresponding western blot in Fig. S3.

      (4) For readers, please add a schematic pathway of histone-protamine replacement in sperm formation in Fig.1 and it would be nice to have a model figure, which contains the authors' idea in the last figure.

      As requested by this reviewer, we have now included a schematic model in Figure 9 to summarize the main conclusions of our work.

      Minor points: 

      (1) Page 2, the second paragraph, "pre-PRM2: Please explain more about pre-PRM2 and/or PRM2 as well as PRM1 (Figure 6).

      More detailed descriptions of PRM2 processing are now given in this paragraph. 

      (2) Page 3, bottom paragraph, line 1: "KO" should be "knockout (KO)".

      Done.

      (3) Page 4, second paragraph bottom: Please explain more about the protein structure of germ-line-specific ATAD2S: how it is different from ATAD2L. Germ-line specific means it is also expressed in ovary?

      As Atad2 is predominantly expressed in embryonic stem cells and in spermatogenic cells, we replaced all through the text germ-line specific by more appropriate terms.

      (4) Figure 1C, western blotting: Wild-type testis extracts, both ATAD2L and -S are present. Does this mean that ATADS2L is expressed in both germ line as well as supporting cells. Please clarify this and, if possible, show the western blotting of spermatids well as spermatocytes.

      Figure 1D shows sections of seminiferous tubules from Atad2 KO mice, in which lacZ expression is driven by the endogenous Atad2 promoter. The results indicate that Atad2 is expressed mainly in post-meiotic cells. Most labeled cells are located near the lumen, whereas the supporting Sertoli cells remain unlabeled. Sertoli cells, which are anchored to the basal lamina, span the entire thickness of the germinal epithelium from the basal lamina to the lumen. Their nuclei, however, are usually positioned closer to the basal membrane. Thus, the observed lacZ expression pattern argues against substantial Atad2 expression in Sertoli cells. 

      (5) Figure 1C: Please explain a bit more about the reduction of ATAD2 proteins in heterozygous mice.

      Done

      (6) Figure 1C: Genotypes of the mice should be shown in the legend.

      Done 

      (7) Figure 1D: Please add a more magnified image of the sections to see the staining pattern in the seminiferous tubules.

      The magnification does not bring more information since we lose the structure of cells within tubules due the nature of treatment of the sections for X-gal staining. Please see comments to question 1C to reviewer 2

      (8) Page 5, first paragraph, line 2, histone dosage: What do the authors meant by the histone dosage? Please explain more or use more appropriate word.

      "Histone dosage" refers to the amount or relative abundance of histone proteins in a cell.

      (9) Figure 2A: Figure 2A: Given the result in Figure 1C, it is interesting to check the amount of HIRA in Atad2 heterozygous mice.

      In Atad2 heterozygous mice, we would expect an increase in HIRA, but only to about half the level seen in the Atad2 homozygous knockout shown in Figure 2A, which is relatively modest. Therefore, we doubt that detecting such a small change—approximately half of that in Figure 2A—would yield clear or definitive results. 

      (10) Figure 2A, legend (n=5): What does this "n" mean? The extract of testes from "5" male mice like Figure 2B. Or 5 independent experiments. If the latter is true, it is important to share the other results in the Supplements.

      “n” refers to five WT and five Atad2 KO males. The legend has been clarified as suggested by the reviewer.

      (11) Figure 2A, legend, line 2, Atad2: This should be italicized.

      Done

      (12) Figure 2B: Please show the quantification of amounts of HIRA protein like Fig. 2A.

      As indicated in the legend, what is shown is a pool of testes from 3 individuals per genotype.

      (13) Figure 2B shows an increased level of HIRA in Atad2 KO testis. This suggests the role of ATAD2 in the protein degradation of HIRA. This possibility should be mentioned or tested since ATAD2 is an AAA+ ATPase. 

      The extensive literature on ATAD2 provides no indication that it is involved in protein degradation. In our early work on ATAD2 in the 2000s, we hypothesized that, as a member of the AAA ATPase family, ATAD2 might associate with the 19S proteasome subunit (through multimerization with the other AAA ATPase member of this regulatory subunit). However, both our published pilot studies (Caron et al., PMID: 20581866) and subsequent unpublished work ruled out this possibility. Instead, since the amount of nucleosome-bound HIRA increases in the absence of ATAD2, we propose that chromatin-bound HIRA is more stable than soluble HIRA once it has been released from chromatin by ATAD2.

      (14) Page 6, second paragraph, line 5, ko: KO should be capitalized.

      Done

      (15) Page 6, second paragraph, line 2 from the bottom, chromatin dynamics: Throughout the text, the authors used "chromatin dynamics". However, all the authors analyzed in the current study is the localization of chromatin protein.  So, it is much easier to explain the results by using "chromatin status," etc. In this context, "accessibility" is better. 

      We changed the term “chromatin dynamics” into a more precise term according to the context used all through the text.

      (16) Figure 3: Please provide the quantification of signals of histone H3.3 in a nucleus or nuclear cytoplasm.

      This request is not clear to us since we do not observe any H3.3 signal in the cytoplasm.

      (17) Figure 3: As the control of specificity in post-meiotic cells, please show the image and quantification of the H3.3 signals in spermatocyte, for example.

      This request is not clear to us. What specificity is meant? 

      (18) Figure 3, bottom panels: Please show what the white lines indicate? 

      The white lines indicate the limit of cell nucleus and estimated by Hoechst staining. This is now indicated in the legend of the figure. 

      (19) Figure 4A: Please explain more about what kind of data is here. Is this wild-type and/or Atad2 KO? The label of the Y-axis should be "mean expression level". What is the standard deviation (SD) here on the X-axis. Moreover, there is only one red open circle, but the number of this class is 5611. All 5611 genes in this group show NO expression. Please explain more.

      The plot displays the mean expression levels (y-axis, labeled as "mean expression level") versus the corresponding standard deviations (x-axis), both calculated from three independent biological replicates of isolated round spermatids (Atad2 wild-type and Atad2 KO). The standard deviation reflects the variability of gene expression across biological replicates. Genes were grouped into four categories (grp1: blue, grp2: cyan, grp3: green, grp4: orange) according to the quartile of their mean expression. For grp4, all genes have no detectable expression, resulting in a mean expression of zero and a standard deviation of zero; consequently, the 5611 genes in this group are represented by a single overlapping point (red open circle) at the origin. 

      (20) Figure 4C: If possible, it would be better to have a statistical comparison between wild-type and the KO.  

      The mean profiles are displayed together with their variability (± 2 s.e.m.) across the four replicates for both ATAD2 WT (blue) and ATAD2 KO (red). For groups 1, 2, and 3, the envelopes of the curves remain clearly separated around the peak, indicating a consistent difference in signal between the two conditions. In contrast, group 4 does not present a strong signal and, accordingly, no marked difference is observed between WT and KO in this group.

      (21) Figure 5, GSEA panels: Please explain more about what the GSEA is in the legend.  The legend has been updated as follows:

      (A) Expression profiles of post-meiotic H3.3-activated genes. The heatmap (left panel) displays the normalized expression levels of genes identified by Fontaine and colleagues as upregulated in the absence of histone H3.3 (Fontaine et al. 2022) for Atad2 WT (WT) and Atad2 KO (KO) samples at days 20, 22, 24, and 26 PP (D20 to D26). The colour scale represents the z-score of log-transformed DESeq2-normalized counts. The middle panel box plots display, pooled, normalized expression levels, aggregated across replicates and genes, for each condition (WT and KO) and each time point (D20 to D26). Statistical significance between WT and KO conditions was determined using a two-sided t-test, with p-values indicated as follows: * for p-value<0.05, ** for p-value<0.01 and *** for p-value<0.001. The right panel shows the results of gene set enrichment analysis (GSEA), which assesses whether predefined groups of genes show statistically significant differences between conditions. Here, the post-meiotic H3.3-activated genes set, identified by Fontaine et al. (2022), is significantly enriched in Atad2 KO compared with WT samples at day 26 (p < 0.05, FDR < 0.25). Coloured vertical bars indicate the “leading edge” genes (i.e., those contributing most to the enrichment signal), located before the point of maximum enrichment score.  (B) As shown in (A) but for the "post-meiotic H3.3-repressed genes" gene set. (C) As shown in (A) but for the " sex chromosome-linked genes " gene set.

      (22) Figure 6. In the KO, the number of green cells is more than red and yellow cells, suggesting the delayed maturation of green (TH2B-positive) cells. It is essential to count the number of each cell and show the quantification.

      The green cells correspond to those expressing TH2B but lacking transition proteins (TP) and protamine 1 (Prm1), indicating that they are at earlier stages than elongating–condensing spermatids. Counting these green cells simply reflects the ratio of elongating/condensing spermatids to earlier-stage cells, which varies depending on the field examined. The key point in this experiment is that in wild-type mice, only red cells (elongating/condensing spermatids) and green cells (earlier stages) are observed. By contrast, in Atad2 KO testes, a significant proportion of yellow cells appears, which are never seen in wild-type tissue. The crucial metric here is the percentage of yellow cells relative to the total number of elongating/condensing spermatids (red cells). In wild-type testes, this value is consistently 0%, whereas in Atad2 KO testes it always ranges between 50% and 100% across all fields containing substantial numbers of elongating/condensing spermatids.

      (23) Figure 8A: Please show the images of sperm (heads) in the KO mice with or without decompaction.

      The requested image is now displayed in Figure S5.

      (24) Figure 8C: In the legend, it says n=5. However, there are more than 5 plots on the graph. Please explain the experiment more in detail.

      The experiment is now better explained in the legend of this Figure.

      Reviewer #2 (Recommendations for the authors): 

      While the study is rigorous and well performed, the following minor points could be addressed to strengthen the manuscript: 

      Figure 1C should indicate each of the different types of cells present in the sections. It would be of interest to show specifically the different post-meiotic germ cells.

      With this type of sample preparation, it is difficult to precisely distinguish the different cell types within the sections. Nevertheless, the staining pattern strongly indicates that most of the intensely stained cells are post-meiotic, situated near the tubule lumens and extending roughly halfway toward the basal membrane.

      In the absence of functional ATAD2, the accumulation of HIRA primarily occurs in round spermatids (Fig. 2B). If technically possible, it would be of great interest to show this by IHC of testis section. 

      Unfortunately, our antibody did not satisfactorily work in IHC.

      The increased of H3.3 signal in Atad2 KO spermatids (Fig. 3) is interpreted because of a reduced turnover. However, alternative explanations (e.g., H3.3 misincorporation or altered chaperone affinity) should not be ruled out. 

      The referee is correct that alternative explanations are possible. However, based on our previous work (Wang et al., 2021; PMID: 34580178), we demonstrated that in the absence of ATAD2, there is reduced turnover of HIRAbound nucleosomes, as well as reduced nucleosome turnover, evidenced by the appearance of nucleosomes in regions that are normally nucleosome-free at active gene TSSs. We have no evidence supporting any other alternative hypothesis.

      In the MS the reduced accessibility at active genes (Fig. 4) is attributed to H3.3 overloading. However, global changes in histone acetylation (e.g., H4K5ac) or other remodelers in KO cells could be also consider.

      In fact, we meant that histone overloading could be responsible for the altered accessibility. This has been clearly demonstrated in case of S. cerevisiae in the absence of Yta7 (S.  cerevisiae’ ATAD2) (PMID: 25406467).

      In relation with the sperm compaction assay (Fig. 8A), the DTT/heparin/Triton protocol may not fully reflect physiological decompaction. This could be validated with alternative methods (e.g., MNase sensitivity). 

      The referee is right, but since this is a subtle effect as it can be judged by normal fertility, we doubt that milder approaches could reveal significant differences between wildtype and Atad2 KO sperms.

      It is surprising that despite the observed alterations in the genome organization of the sperm, the natural fertility of the KO mice is not affected (Fig. 8C). This warrants deeper discussion: Is functional compensation occurring (e.g., by p97/VCP)? Analysis of epididymal sperm maturation or uterine environment could provide insights.

      As detailed in the Discussion section, this work, together with our previous study (Wang et al., 2021; PMID: 34580178), highlights an overlooked level of regulation in histone chaperone activity: the release of chromatinbound factors following their interaction with chromatin. This is an energy-dependent process, driven by ATP and the associated ATPase activity of these factors. Such activity could be mediated by various proteins, such as p97/VCP or DNAJC9–HSP70, as discussed in the manuscript, or by yet unidentified factors. However, most of these mechanisms are likely to occur during the extensive histone-to-histone variant exchanges of meiosis and post-meiotic stages. To the best of our knowledge, epididymal sperm maturation and the uterine environment do not involve substantial histone-to-histone or histone-to-protamine exchanges.

      The authors showed that MSCI genes present an enhancement of repression in the absence of ATAD2 by enhancing H3.3 function. It would be also of interest to analyze the behavior of the Sex body during its silencing (zygotene to pachytene) by looking at different markers (i.e., gamma-H2AX phosphorylation, Ubiquitylation etc). 

      The referee is correct that this is an interesting question. Accordingly, in our future work, we plan to examine the sex body in more detail during its silencing, using a variety of relevant markers, including those suggested by the reviewer. However, we believe that such investigations fall outside the scope of the present study, which focuses on the molecular relationship between ATAD2 and H3.3, rather than on the role of H3.3 in regulating sex body transcription. For a comprehensive analysis of this aspect, studies should primarily focus on the H3.3 mouse models reported by Fontaine and colleagues (PMID: 35766398).

      Fig. 6: Co-staining of TH2B/TP1/PRM1 is convincing but would benefit from quantification (% cells with overlapping signals).

      The green cells correspond to those expressing TH2B but lacking transition proteins (TP) and protamine 1 (Prm1), indicating that they are at earlier stages than elongating–condensing spermatids. Counting these green cells simply reflects the ratio of elongating/condensing spermatids to earlier-stage cells, which varies depending on the field examined. The key point is that in wild-type mice, only red cells (elongating/condensing spermatids) and green cells (earlier stages) are observed. By contrast, in Atad2 KO testes, a significant proportion of yellow cells appears, which are never seen in wild-type tissue. The crucial metric is the percentage of yellow cells relative to the total number of elongating/condensing spermatids (red cells). In wild-type testes, this value is consistently 0%, whereas in Atad2 KO testes it always ranges between 50% and 100% across all fields containing substantial numbers of elongating/condensing spermatids.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #2 (Public review):

      In the manuscript by Fu et al., the authors developed a chemo-immunological method for the reliable detection of Kacac, a novel post-translational modification, and demonstrated that acetoacetate and AACS serve as key regulators of cellular Kacac levels. Furthermore, the authors identified the enzymatic addition of the Kacac mark by acyltransferases GCN5, p300, and PCAF, as well as its removal by deacetylase HDAC3. These findings indicate that AACS utilizes acetoacetate to generate acetoacetyl-CoA in the cytosol, which is subsequently transferred into the nucleus for histone Kacac modification. A comprehensive proteomic analysis has identified 139 Kacac sites on 85 human proteins. Bioinformatics analysis of Kacac substrates and RNA-seq data reveal the broad impacts of Kacac on diverse cellular processes and various pathophysiological conditions. This study provides valuable additional insights into the investigation of Kacac and would serve as a helpful resource for future physiological or pathological research.

      The authors have made efforts to revise this manuscript and address my concerns. The revisions are appropriate and have improved the quality of the manuscript.

      We appreciate the constructive and thoughtful feedbacks, which have been invaluable in enhancing the quality of our manuscript.

      Reviewer #3 (Public review):

      Summary:

      This paper presents a timely and significant contribution to the study of lysine acetoacetylation (Kacac). The authors successfully demonstrate a novel and practical chemoimmunological method using the reducing reagent NaBH4 to transform Kacac into lysine βhydroxybutyrylation (Kbhb).

      Thank you for the positive and insightful comments.

      Strengths:

      This innovative approach enables simultaneous investigation of Kacac and Kbhb, showcasing its potential in advancing our understanding of post-translational modifications and their roles in cellular metabolism and disease.

      We are grateful for the reviewer’s comments, which has contributed to enhancing the quality of our study.

      Weaknesses:

      The experimental evidence presented in the article is insufficient to fully support the authors' conclusions. In the in vitro assays, the proteins used appear to be highly inconsistent with their expected molecular weights, as shown by Coomassie Brilliant Blue staining (Figure S3A). For example, p300, which has a theoretical molecular weight of approximately 270 kDa, appeared at around 37 kDa; GCN5/PCAF, expected to be ~70 kDa, appeared below 20 kDa. Other proteins used in the in vitro experiments also exhibited similarly large discrepancies from their predicted sizes. These inconsistencies severely compromise the reliability of the in vitro findings. Furthermore, the study lacks supporting in vivo data, such as gene knockdown experiments, to validate the proposed conclusions at the cellular level.

      We appreciate the reviewer’s comments. In the biochemical assays, we used the expressed catalytic domains of HATs—rather than the full-length proteins for activity testing. Specifically, the following constructs were expressed and purified: p300 (1287– 1666), GCN5 (499-663), PCAF (493-658), MOF (125-458), MOZ (497-780), MBP-MORF (361-716), Tip60 (221-512), HAT1 (20-341), and HBO1 (full length). This resulted in the observed discrepancies in molecular weight in Figure S3A compared to the expected fulllength weights. 

      Although a recent study (PMID: 37382194) reported the acetoacetyltransferase activities of p300 and GCN5 in cells, we recognize that additional knockdown experiments would be necessary to substantiate their contributions to in vivo Kacac generation and to explore the functional roles of Kacac in an enzyme-specific context. We plan to address these kinds of research issues in our future work.

  2. Dec 2025
    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review)

      The Cx3cr1/EGFP line labels all myeloid cells, which makes it difficult to conclude that all observed behaviors are attributable to microglia rather than infiltrating macrophages. The authors refer to this and include it as a limitation. Nonetheless, complementary confirmation by additional microglia markers would strengthen their claims. 

      We appreciate the reviewer’s insightful comment regarding the cellular identity of the enveloping myeloid cells. As suggested, we performed triple co-immunostaining of SSLOW-infected Cx3cr1/EGFP mice using markers for neurons (NeuN), myeloid cells (IBA1), and resident microglia (TMEM119 or P2Y12). Because formic acid treatment used to deactivate prions abolishes the EGFP signal, we relied on IBA1 staining to identify the myeloid population. Our results confirmed that IBA1⁺ cells exhibiting the envelopment behavior are also TMEM119⁺ and P2Y12⁺, consistent with a resident microglial phenotype. These new data are presented in Figures S3 and S4 and described in the final section of the Results.

      Although the authors elegantly describe dynamic surveillance and envelopment hypothesis, it is unclear what the role of this phenotype is for disease progression, i.e., functional consequences. For example, are the neurons that undergo sustained envelopment more likely to degenerate? 

      We appreciate this important question regarding the functional implications of neuronal envelopment. At present, technical limitations prevent us from continuously tracking the fate of individual enveloped neurons in prion-infected mice. Nevertheless, our recent study demonstrated that P2Y12 knockout increases the prevalence of neuronal envelopment and accelerates disease progression (Makarava et al., 2025, J. Neuroinflammation). These findings suggest that while microglial envelopment may represent an adaptive response to increased neuronal surveillance demands, excessive envelopment, as observed in the absence of P2Y12, appears to be maladaptive. A new paragraph has been added to the Discussion to address this point.

      Moreover, although the increase in mobility is a relevant finding, it would be interesting for the authors to further comment on what the molecular trigger(s) is/are that might promote this increase. These adaptations, which are at least long-lasting, confer apparent mobility in the absence of external stimuli. 

      We thank the reviewer for this thoughtful suggestion. The molecular mechanisms underlying the increased mobility of microglia in prion-infected brains remain to be identified, and we plan to pursue this question in future studies. One possibility we briefly discuss in the revised manuscript is that proinflammatory signaling, mediated by secreted cytokines or interleukins, may drive this phenotype. Supporting this hypothesis, recent work has shown that IFNγ enhances microglial migration in the adult mouse cortex (doi:10.1073/pnas.2302892120). This work has been cited in the revised manuscript.

      The authors performed, as far as I could understand, the experiments in cortical brain regions. There is no clear rationale for this in the manuscript, nor is it clear whether the mobility is specific to a particular brain region. This is particularly important, as microglia reactivity varies greatly depending on the brain region. 

      We appreciate this insightful comment highlighting the importance of regional determinants of microglial reactivity, which indeed aligns with our ongoing research interests. In our previous studies, neuronal envelopment by microglia was observed consistently across all prion-affected brain regions exhibiting neuroinflammation. Assuming that envelopment requires microglial mobility, it is reasonable to speculate that microglia are mobile in all brain regions affected by prions and displaying neuroinflammatory responses. In the current study, we focused exclusively on the cortex because this region was used for quantifying the prevalence of neuronal envelopment as a function of disease progression in our prior work (DOI: 10.1172/JCI181169), which guided the present study design. Our ongoing investigations indicate that the prevalence of envelopment is region-dependent and correlates with microglial reactivity/the degree of neuroinflammation. In prion diseases, the degree of microglial reactivity is dictated by the tropism of specific prion strains to distinct brain regions. Notably, our prior studies have shown that strain-specific sialylation patterns of PrP<sup>Sc</sup> glycans play a key role in determining both regional strain tropism and the extent of neuroinflammatory activation (DOI: 10.3390/ijms21030828, DOI: 10.1172/JCI138677). In response to this comment, we have added a brief rationale for using the cortex in the Results section.

      It would be relevant information to have an analysis of the percentage of cells in normal, sub-clinical, early clinical, and advanced stages that became mobile. Without this information, the speed/distance alone can have different interpretations.

      We thank the reviewer for this valuable suggestion. The percentage of mobile cells across normal, sub-clinical, early clinical, and advanced disease stages is presented in Figure 3b and described in the final paragraph of the section “Enveloping behavior of reactive myeloid cells.”

      Reviewer #2 (Public review)

      The number of individual cells tracked has been provided, but not the number of individual mice. The sex of the mice is not provided. 

      We used N = 3 animals per group throughout the study; this information has now been added to the figure legends. Animals of both sexes were included in random proportions. The sex information is now listed for each experiment in the Animals subsection of the Methods.

      The statistical approach is not clear; was each cell treated as a single observation? 

      Yes, with the exception of the heat map in Figure 2d, all mobility parameters are analyzed and presented at the level of individual cells, with each cell treated as an independent observation. The primary aim of this study is to characterize behavioral patterns of single reactive myeloid cells. Analyzing data at the cell level allows us to capture the full distribution of cell behaviors and to preserve biologically meaningful heterogeneity within and across animals. By contrast, averaging values per animal would largely mask this variability. In the heat map in Figure 2d, data are averaged per animal, specifically to illustrate inter-animal variability within each group and to visualize changes across disease progression.

      The potential for heterogeneity among animals has not been addressed. 

      To address this concern, we now include a new Supplemental Figure (Figure S4)  presenting the data using Superplots, in which individual cells are shown as dots, animal-level average as circles, and group means calculated based on animals as black horizontal lines. These plots demonstrate that cell mobility measures are highly consistent across animals within each group, indicating limited inter-animal heterogeneity.

      Validation of prion accumulation at each clinical stage of the disease is not provided. 

      We now provide validation of PrP<sup>Sc</sup> accumulation across disease stages by Western blot, along with quantitative analysis, in a new Supplemental Figure (Figure S2). This confirms progressive PrP<sup>Sc</sup> accumulation with advancing disease.

      How were the numerous captures of cells handled to derive morphological quantitative values? Based on the videos, there is a lot of movement and shape-shifting.

      The following description has been added to Methods to clarify morphology analysis: For microglial morphology analysis, we quantified morphological parameters (radius, area, perimeter, and shape index) for individual EGFP⁺ cells in each time frame of the time-lapse recordings using the TrackMate 7.13.2 plugin in FIJI. Parameter values for each cell were then averaged across the entire three-hour imaging period to obtain a single mean value per cell.

      While it is recognized that there are limits to what can be measured simultaneously with live imaging, the authors appear to have fixed tissues from each time point too - it would be very interesting to know if the extent or prion accumulation influences the microglial surveillance - i.e., do the enveloped ones have greater pathology. 

      This is very interesting question which is difficult to answer due to technical challenges in monitoring the pathology or faith of individual neuronal cells as a function of their envelopment in live prion-infected animals. Our previous work revealed that both accumulation of total PrP<sup>Sc</sup> in a brain and  accumulation of PrP<sup>Sc</sup> specifically in lysosomal compartments of microglia due to phagocytosis precedes the onset of neuronal envelopment (DOI: 10.1172/JCI181169).  Moreover, the onset of neuronal envelopment occurred after a noticeable decline in neuronal levels of Grin1, a subunit of the NMDA receptor essential for synaptic plasticity. Reactive microglia were observed to envelop Grin1-deficient neurons, suggesting that microglia respond to neuronal dysfunction. However, considering that envelopment is very dynamic and - in most cases - reversible, correlating the degree of envelopment with dysfunction of individual neurons is technically challenging.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors): 

      (1) I recommend performing additional immunostaining using microglial markers to address specificity. 

      These new data showing immunostaining for markers of resident microglia TMEM119 and P2Y12 are presented in Figures S6 and S7 and described in the final section of the Results.

      (2) The authors can at least further discuss the functional consequences of their findings in further detail. 

      A new paragraph has been added to the Discussion to address this point.

      (3) Quantify the % of cells that become mobile in the different conditions. 

      The percentage of mobile cells across normal, sub-clinical, early clinical, and advanced disease stages is presented in Figure 3b and described in the final paragraph of the section “Enveloping behavior of reactive myeloid cells.”

      (4) Improve method details on the brain regions used and further expand the statistical section. 

      We have expanded the Statistical Analysis section to indicate whether statistical comparisons and mean values were calculated at the single-cell level or the animal level for each analysis. The specific statistical tests used and the number of animals (N) are now reported in the corresponding figure legends. The sex of animals is provided in Table 1 (Methods). Only the cortical region was examined in this study; this information is stated in the Methods and is now also noted in the figure legends for clarity.

      Reviewer #2 (Recommendations for the authors): 

      (1) More details on members of the PY2 receptor family expressed in microglia would be helpful. The study highlights a previously published prion-induced decline in the expression of P2Y12, a microglial marker that is required for intracellular neuron-microglial contacts, and P2Y6, involved in calcium transients, which is required for hypermotility. How are members of this family of receptors regulated at the gene and/or protein level in microglial and given their responsiveness to nucleotide ligands, are other members implicated in the properties being quantified here? 

      We appreciate the reviewer’s insightful comment. To address this point, we examined the expression of multiple P2Y receptors and ATP-gated P2X channels known to contribute to microglial surveillance, activation, motility, and phagocytosis, alongside the activation markers Tlr2, Cd68, and Trem2. Bulk brain transcript analyses indicated that all examined genes were upregulated in SSLOW-infected mice relative to controls (new Figure S5a). However, because microglial proliferation substantially increases microglial numbers during prion disease progression, bulk tissue measurements do not necessarily reflect per-cell expression levels. Therefore, we normalized gene expression values to the microglia-specific marker Tmem119, whose per-cell expression remains stable across disease stages (Makarava et al., 2025, J. Neuroinflammation). After normalization, Tlr2, Cd68, and Trem2 were increased approximately 10-, 6-, and 4-fold, respectively. In contrast, P2 receptor genes showed more modest changes: P2ry6 increased ~3-fold, P2ry13 ~2-fold, and P2rx7 ~1.3-fold, while P2rx4 remained unchanged (Figure S5a). Within the scope of the present study, we focused on P2Y6 due to (i) its role in regulating calcium transients, (ii) the magnitude of its upregulation relative to other P2 receptors, and (iii) its highly microglia-specific expression in the CNS. We note that currently available commercial P2Y6 antibodies lack sufficient specificity, making reliable assessment of protein-level expression challenging.

      (2) Is P2Y6 expressed in any other cell type that might account for the blunted mobility of the microglia? The authors mention P2Y12 also identifies the GFP cells; however, it would be beneficial to highlight the specificity of the target in the ex vivo treatment of the infected slices.

      In the brain, both P2Y12 and P2Y6 are considered highly specific to resident microglia under physiological and neuroinflammatory conditions. P2Y12 is, in fact, widely used as a canonical marker of homeostatic and resident microglia. While P2Y6 is also expressed in peripheral myeloid cells such as macrophages, our phenotypic characterization indicates that the cells exhibiting neuronal envelopment are TMEM119⁺ and P2Y12⁺, consistent with a resident microglial identity. These data, including new analyses added to the revised manuscript, support that the cells responding to P2Y6 signaling in our ex vivo slice experiments are resident microglia.

      (3) The fluorescent mouse lacks Cx3cr1 - have the authors investigated why there were no apparent consequences, at least in the context of prion infection? Are there functional redundancies that might be harnessed? Does this impact the generalizability of the findings here?

      The role of Cx3cr1 in prion disease has been directly examined in two independent studies (doi: 10.1099/jgv.0.000442; doi: 10.1186/1471-2202-15-44). One study reported no effect of Cx3cr1 deficiency on disease incubation time, whereas the other observed only a minor difference. Importantly, both studies found no detectable alterations in microglial activation patterns, cytokine expression, or PrP<sup>Sc</sup> deposition in Cx3cr1-deficient mice compared to wild-type controls. Our own data (Figure S1) are consistent with these findings: disease course and PrP<sup>Sc</sup> deposition were comparable between Cx3cr1/EGFP and wild-type mice. Moreover, we observed reactive microglial envelopment of neurons in both genotypes. Microglia isolated from SSLOW-infected Cx3cr1/EGFP mice also displayed similarly elevated mobility in vitro, in agreement with our previous observations of high mobility of microglia isolated from SSLOW-infected wild-type mice (Makarava et al., 2025, J. Neuroinflammation). Taken together, these results indicate that Cx3cr1 is not a key determinant of reactive microglial mobility or envelopment behavior in prion disease. Thus, the use of the Cx3cr1/EGFP reporter line does not compromise the generalizability of our conclusions.

      (4) The distinction between high mobility and low mobility microglia is interesting - is there any evidence to suggest that the slow-moving microglia are actually a separate class - do enveloping microglia exhibit both mobility states - can the authors comment on plasticity here? 

      We appreciate this insightful comment, which closely aligns with our ongoing interests. At present, we do not have evidence to support that high- versus low-mobility microglia represent distinct molecular phenotypes. Given that our time-lapse imaging spans only a three-hour window, it remains unclear whether these mobility states reflect stable cell-intrinsic properties or transient phases within a dynamic surveillance process. Notably, we observed that individual cells can transition between more stationary, neuron-associated states and highly mobile states within the same imaging session. In future work, we intend to investigate whether prolonged interactions with neuronal somas or other microenvironmental cues may drive diversification of reactive myeloid cell phenotypes.

      (5) In the discussion, the authors speculate about "collective coordinated decision making" - that seems a stretch unless greater context is provided. The fact that several microglia can be found in contact with an individual neuron and that each microglia can connect with multiple neurons simultaneously is certainly interesting; however, evidence for hive behavior is entirely lacking.

      We agree with the reviewer that our previous wording overstated the interpretation. The statement regarding collective decision-making has been removed.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment:

      This is an important study, supported by solid to convincing data, that suggests a model for diet selection in C. elegans. The significance is that while C. elegans has long been known to be attracted to bacterial volatiles, what specific bacterial volatiles may signify to C. elegans is largely unknown. This study also provides evidence for a possible odorant/GPCR pairing.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Siddiqui et al., investigate the question of how bacterial metabolism contributes to the attraction of C. elegans to specific bacteria. They show that C. elegans prefers three bacterial species when cultured in a leucine-enriched environment. These bacterial species release more isoamyl alcohol, a known C. elegans attractant, when cultured with leucine supplement than without leucine supplement. The study shows correlative evidence that isoamyl alcohol is produced from leucine by the Ehrlich pathway. In addition, they show that SRD-12 (SNIF-1) is likely a receptor for isoamyl alcohol because a null mutant of this receptor exhibits lower chemotaxis to isoamyl alcohol and lower preference for leucine-enriched bacteria.

      Strengths:

      (1) This study takes a creative approach to examine the question of what specific volatile chemicals released by bacteria may signify to C. elegans by examining both bacterial metabolism and C. elegans preference behavior. Although C. elegans has long been known to be attracted to bacterial metabolites, this study may be one of the first to examine the role of a specific bacterial metabolic pathway in mediating attraction.

      (2)  A strength of the paper is the identification of SRD-12 (SNIF-1) as a likely receptor for isoamyl alcohol. The ligands for very few olfactory receptors have been identified in C. elegans and so this is a significant addition to the field. The srd-12 (snif-1) null mutant strain will likely be a useful reagent for many labs examining olfactory and foraging behaviors.

      Weaknesses:

      (1) The authors write that the leucine metabolism via the Ehrlich pathway is required for the production of isoamyl alcohol by three bacteria (CEent1, JUb66, BIGb0170), but their evidence for this is correlation and not causation. They write that the gene ilvE is a bacterial homolog of the first gene in the yeast Ehrlich pathway (it would be good to include a citation for this) and that the gene is present in these three bacterial strains. In addition, they show that this gene, ilvE, is upregulated in CEent1 bacteria upon exposure to leucine. To show causation, they need to knockout ilvE from one of these strains, show that the bacteria does not have increased isoamyl alcohol production when cultured on leucine, and that the bacteria is no longer attractive to C. elegans.

      Thank you for the comment. We have added the appropriate citation [1,2]. We agree that worms’ diet preference for the preferred strains upon ilvE knockout will further strengthen the claim for IAA being used as a proxy for leucine-enriched diet. Currently, protocols and tools for genetic manipulations for CeMbio strains are not available, making this experiment not feasible at this time.  

      (2) The authors examine three bacterial strains that C. elegans showed increased preference when grown with leucine supplementation vs. without leucine supplementation. However, there also appears to be a strong preference for another strain, JUb0393, when grown on plus leucine (Figure 1B). It would be good to include statistics and criteria for selecting the three strains.

      Thanks for your comment. We agree that for Pantoea nemavictus, JUb393, worms seem to prefer the leucine supplemented (+ LEU) bacteria over unsupplemented (-LEU). However, when given a choice between the individual CeMbio bacteria and E. coli OP50, worms showed preference for only CEent1, JUb66, and BIGb0170 (Figure 1F). Consequently, CEent1, JUb66, and BIGb0170 were selected for further analyses. We have included statistics for Figure 1B-C and Figure S1A-G with details mentioned in the figure legend. 

      (3) Although the behavioral evidence that srd-12 (snif-1) gene encodes a receptor for isoamyl alcohol is compelling, it does not meet the standard for showing that it is an olfactory receptor in C. elegans. To show it is indeed a likely receptor one or more of the following should be done:

      (a) Calcium imaging of AWC neurons in response to isoamyl alcohol in the receptor mutant with the expectation that the response would be reduced or abolished in the mutant compared to wildtype.

      (b)"A receptor swap" experiment where the SRD-12 (SNIF-1) receptor is expressed in AWB repulsive neuron in SRD-12 (SNIF-1) receptor mutant background with the expectation that with receptor swap C. elegans will now be repulsed from isoamyl alcohol in chemotaxis assays (experiment from Sengupta et al., 1996 odr-10 paper).

      Thanks for all your comments and suggestions. While the lab currently does not have the necessary expertise to conduct calcium imaging of neurons, we have performed additional experiments to confirm the requirements of AWC neurons for SNIF-1 function. We generated transgenic worms with extrachromosomal array expressing snif-1 under (a) AWC-specific promoter, odr-1, and (b) AWB-specific promoter, str-1. As shown in new panel 6H in the revised manuscript and Author response image 1, we found that overexpression of snif-1 in AWC neurons completely rescues the chemotaxis defect of snif-1 mutant (referred at VSL2401), whereas upon the “receptor swap" in AWB neurons IAA is sensed as a repellent.  

      Author response image 1.

      (A) Chemotaxis index (CI) of WT, VSL2401, VSL2401 [AWCp::snif-1] and VSL2401 [AWBp::snif-1] worms to IAA at 1:1000 dilution. Significant differences are indicated as **** P ≤ 0.0001 determined by one-way ANOVA followed by post hoc Dunnett’s multiple comparison test. Error bars indicate SEM (n≥15).

      (4) The authors conclude that C. elegans cannot detect leucine in chemotaxis assays. It is important to add the method for how leucine chemotaxis assay was done in order to interpret these results. Because leucine is not volatile if leucine is put on the plates immediately before the worms are added (as in a traditional odor chemotaxis assay), there is no leucine gradient for the worm to detect. It would be good to put leucine on the plate several hours before worms are introduced so worms have the possibility to be able to detect the gradient of leucine (for example, see Wakabayashi et al., 2009).

      Previously, the chemotaxis assays with leucine were performed like traditional odor chemotaxis assays. We also performed chemotaxis assay as detailed in Shingai et al 2005[3]. Leucine was spotted on the assay plates 5 hours prior to the introduction of worms on the plates. As shown in new panel S1H in the revised manuscript, wild-type worms do not show response to leucine in the modified chemotaxis assay.

      We have included the experimental details for leucine chemotaxis assays in the revised manuscript.  

      (5) The bacterial preference assay entitled "odor-only assay" is a misleading name. In the assay, C. elegans is exposed to both volatile chemicals (odors) and non-volatile chemicals because the bacteria are grown on the assay plate for 12 hours before the worms are introduced to the assay plate. In that time, the bacteria is likely releasing non-volatile metabolites into the plate which may affect the worm's preference. A true odor-only assay would have the bacteria on the lid and the worms on the plate.

      The ‘odor-only’ diet preference assay does not allow for non-volatile chemicals to reach worms. We achieved this by using tripartite dishes where the compartments containing worms and bacterial odors are separated by polystyrene barriers. At the time of the assay, worms were spotted in a separate compartment from that of bacteria (as shown in schematic 1A). The soluble metabolites released by the bacteria during their growth will accumulate in the agar within the bacterial compartment alone such that worms only encounter the volatile metabolites produced by bacteria wafting past the polystyrene barrier.

      (6) The findings of the study should be discussed more in the context of prior literature. For example, AWC neurons have been previously shown to be involved in bacterial preference (Harris et al., 2014; Worthy et al., 2018). In addition, CeMbio bacterial strains (the strains examined in this study) have been previously shown to release isoamyl alcohol (Chai et al. 2024).

      Thanks for the suggestion. We have modified the Discussion section to discuss the study in the light of relevant prior literature.  

      Reviewer #2 (Public review):

      Summary:

      Siddiqui et al. show that C. elegans prefers certain bacterial strains that have been supplemented with the essential amino acid (EEA) leucine. They convincingly show that some leucine enriched bacteria stimulate the production of isoamyl alcohol (IAA). IAA is an attractive odorant that is sensed by the AWC. The authors an identify a receptor, SRD-12 (SNIF-1), that is expressed in the AWC chemosensory neurons and is required for chemotaxis to IAA. The authors propose that IAA is a predominant olfactory cue that determines diet preference in C. elegans. Since leucine is an EAA, the authors propose that worm IAA sensing allows the animal provides a proxy mechanism to identify EAA rich diets.

      Strengths:

      The authors propose IAA as a predominant olfactory cue that determines diet preference in C. elegans providing molecular mechanism underlying diet selection. They show that wild isolates of C. elegans have a strong chemotactic response to IAA indicating that IAA is an ecologically relevant odor for the worm. The paper is well written, and the presented data are convincing and well organized. This is an interesting paper that connects chemotactic response with bacterially produced odors and thus provides an understanding of how animals adapt their foraging behavior through the perception of molecules that may indicate the nutritional value.

      Weaknesses:

      Major:

      While I do like the way the authors frame C. elegans IAA sensing as mechanisms to identify leucine (EAA) rich diets it is not fully clear whether bacterial IAA production is a proxy for bacterial leucine levels.

      (1) Can the authors measure leucine (or other EAA) content of the different CeMbio strains? This would substantiate the premise in the way they frame this in the introduction. While the authors convincingly show that leucine supplementation induces IAA production in some strains, it is not clear if there are lower leucine levels in the different in non-preferred strains.

      Thanks for your suggestion. Estimating leucine levels in various bacteria will provide useful information, and we hope to do so in future studies.

      (2) It is not clear whether the non-preferred bacteria in Figure 1A and 1B have the ability to produce IAA. To substantiate the claim that C. elegans prefers CEent1, JUb66, and BIGb0170 due to their ability to generate IAA from leucine, it would measure IAA levels in non-preferred bacteria (+ and - leucine supplementation). If the authors have these data it would be good to include this.

      Thanks for the suggestion. We have included the table indicating the presence or absence of IAA production by all the bacteria under + LEU and – LEU conditions (Table S2). Some of the nonpreferred bacteria indeed produce isoamyl alcohol. However, the abundance of IAA in these strains is significantly less than in the preferred bacteria.  

      Using the available genomic sequence data, we found that all CeMbio strains encode IlvE-like transaminase enzymes[4]. This suggests that presumably all the bacteria have the metabolic capacity to make alpha-ketoisocaproate (an intermediate in IAA biosynthetic pathway) from leucine. However, the regulation of metabolic flux is likely to be quite complex in various bacteria.  

      (3) The authors would strengthen their claim if they could show that deletion or silencing ilvE enzyme reduces IAA levels and eliminates the increased preference upon leucine supplementation.

      We agree that testing worms’ diet preference for the preferred strains upon ilvE knockout will further strengthen the claim for IAA being crucial for finding leucine-enriched diet. Currently the lab does not have the necessary expertise and standardize protocols to do genetic manipulations for the CeMbio strains.

      (4) While the three preferred bacteria possess the ilvE gene, it is not clear whether this enzyme is present in the other non-preferred bacterial strains. As far as I know, the CeMbio strains have been sequenced so it should be easy to determine if the non-preferred bacteria possess the capacity to make IAA. Does the expression of ilvE in e.g. E. coli increase its preference index or are the other genes in the biosynthesis pathway missing?

      Thanks for the suggestion. Using the available genomic sequence data, we find that all the bacteria in the CeMbio collection possess IlvE-like transaminase necessary for synthesis of alphaketoisocaproate, key metabolite in leucine turn over as well as precursor for IAA [4]. E. coli has an IlvE encoding gene in its genome [2]. However, we do not find IAA in the headspace of E. coli either with or without leucine supplementation. This indicates either (i) E. coli lacks enzymes for subsequent steps in IAA biosynthesis or (ii) leucine provided under the experimental regime is not sufficient to shift the metabolic flux to IAA production.  

      Previous studies have suggested that in yeast, the final two steps leading to IAA production are catalyzed by decarboxylase and dehydrogenase enzymes1. The genomic and metabolic flux data available for CeMbio do not describe specific enzymes leading up to IAA synthesis [4].  

      (5) It is strongly implied that leucine-rich diets are beneficial to the worm. Do the authors have data to show the effect on leucine supplementation on C. elegans healthspan, life-span or broodsize?

      Edwards et al. 2015 reported a 15% increase in the lifespan of worms upon 1 mM leucine supplementation [5]. Wang et al 2018 also showed lifespan extension upon 1 mM and 10 mM leucine supplementation. They also reported that while leucine supplementation did not have any effect on brood size, it did make worms more resistant to heat, paraquat, and UV-stress [6]. These studies have been included in the discussion section.

      Other comments:

      Page 6. Figure 2c. While the authors' conclusions are correct based on AWC expts. it would be good at this stage to include the possibility that odors that enriched in the absence of leucine may be aversive.

      Thanks for the comment. We have tested the chemotaxis response of the worms for most of the odors produced by CeMbio strains without leucine supplementation. We did not find any odor that is aversive to worms. However, we cannot completely rule out the possibility that a low abundance of aversive odor in the headspace of the bacteria was missed.

      Interestingly, we did identify 2-nonanone, a known repellent, in the headspace of the preferred bacteria upon leucine supplementation. However, the abundance of 2-nonanone in headspace of bacteria is relatively low (less than 1% for CEent1, and JUb66, and ~10% for BIGb0170). This suggests that the relative abundance of odors in an odor bouquet may be a relevant factor in determining worms’ reference.  

      Page 6. IAA increases 1.2-4 folds upon leucine supplementation. If the authors perform a chemotaxis assay with just IAA with 1-2-4 fold differences do you get the shift in preference index as seen with the bacteria? i.e. is the difference in IAA concentration sufficient to explain the shift in bacterial PI upon leucine supplementation? Other attractants such as Acetoin and isobutanol go up in -Leu conditions.

      Thanks for the suggestion. As shown in Figure S2H and S2I, when given a choice between a concentration of IAA (1:1000 dilution) attractive to worms and a 4-fold higher amount of IAA, worms chose the latter. This result suggests that worms can distinguish between relatively small difference in concentrations of IAA.

      We agree that the relative abundance of Acetoin and Isobutanol is high in -LEU conditions. The presence of other attractants in - LEU conditions should skew the preference of worms for – LEU bacteria. However, we found that worms prefer + LEU bacteria (Figure 1B), suggesting that the abundance of IAA mainly influences the diet preference of the worms.  

      Page 14-15. The authors identify a putative IAA receptor based on expression studies. I compliment the authors for isolating two CRISPR deletion alleles. They show that the srd-12 (snif-1) mutants have obvious defects in IAA chemotaxis. Very few ligand-odorant receptors combinations have been identified so this is an important discovery. CenGen data indicate that srd-12 (snif-1) is expressed in a limited set of neurons. Did the authors generate a reporter to show the expression of srd-12 (snif-1)? This is a simple experiment that would add to the characterization of the SRD-12 (SNIF-1) receptor. Rescue experiments would be nice even though the authors have independent alleles. To truly claim that SRD-12 (SNIF-1) is the ligand for IAA and activates the AWC neurons would require GCamp experiments in the AWC neuron or heterologous expression system. I understand that GCamp imaging might not be part of the regular arsenal of the lab but it would be a great addition (even in collaboration with one of the many labs that do this regularly). Comparing AWC activity using GCaMP in response IAA-producing bacteria with high leucine levels in both wild-type and SRD-12 (SNIF-1) deficient backgrounds, would further support their narrative. I leave that to the authors.

      Thanks for your comments and suggestions. To address this comment, we rescued snif-1 mutant (referred as VSL2401) with extrachromosomal array expressing snif-1 under AWC-specific promoter as well as its native promoter. As shown in Figure 6H and Author response image 2, we find that both transgenic lines show a complete rescue of chemotaxis response to isoamyl alcohol. To find where snif-1 is expressed, we generated a transgenic line of worms expressing GFP under snif-1 promoter, and mCherry under odr-1 promoter (to mark AWC neurons). As shown in Figure 6I, we found that snif-1 is expressed faintly in many neurons, with strong expression in one of the two AWC neurons marked by odr-1::mCherry. This result suggests that SNIF-1 is expressed in AWC neuron.

      We hope to perform GCaMP assay and further characterization of SNIF-1 in the future.

      Author response image 2.

      Chemotaxis index (CI) of WT, VSL2401, VSL2401 [AWCp:: snif-1] and VSL2401 [snif-1p::snif-1] worms to IAA at 1:1000 dilution. Significant differences are indicated as **** P ≤ 0.0001 determined by one-way ANOVA followed by post hoc Dunnett’s multiple comparison test. Error bars indicate SEM (n≥15).

      Minor:

      Page 4 "These results suggested that worms can forage for diets enriched in specific EAA, leucine...." More precise at this stage would be to state " These results indicated that worms can forage for diets supplemented with specific EAA...".

      We have changed the statement in the revised manuscript.

      Page 5."these findings suggested that worms not only rely on odors to choose between two bacteria but also to find leucine enriched bacteria" This statement is not clear to me and doesn't follow the data in Fig. S2. Preferred diets in odorant assays are the IAA producing strains.

      Thanks for your comment. We have revised the manuscript to make it clear. “Altogether, these findings suggested that worms rely on odors to distinguish different bacteria and find leucineenriched bacteria”. This statement concludes all the data shown in Figure 1 and Figure S1.  

      Page 5. Figure S2A provides nice and useful data that can be part of the main Figure 1.

      Thanks for the comment. We have incorporated the data from Figure S2A to main Figure 1.

      Reviewer #3 (Public review):

      Summary:

      The authors first tested whether EAA supplementation increases olfactory preference for bacterial food for a variety of bacterial strains. Of the EAAs, they found only leucine supplementation increased olfactory preference (within a bacterial strain), and only for 3 of the bacterial strains tested. Leucine itself was not found to be intrinsically attractive.

      They determined that leucine supplementation increases isoamyl alcohol (IAA) production in the 3 preferred bacterial strains. They identify the biochemical pathway that catabolizes leucine to IAA, showing that a required enzyme for this pathway is upregulated upon supplementation.

      Consistent with earlier studies, they find that AWC olfactory neuron is primarily responsible for increased preference for IAA-producing bacteria.

      Testing volatile compounds produced by bacteria and identified by GC/MS, and identified several as attractive, most of them require AWC for the full effect. Adaptation assays were used to show that odorant levels produced by bacterial lawns were sufficient to induce olfactory adaptation, and adaptation to IAA reduced chemotaxis to leucine-supplemented lawns. They then showed that IAA attractiveness is conserved across wild strains, while other compounds are more variable, suggesting IAA is a principal foraging cue.

      Finally, using the CeNGEN database, they developed a list of candidate IAA receptors. Using behavioral tests, they show that mutation of srd-12 (snif-1) greatly impairs IAA chemotaxis without affecting locomotion or attraction to another AWC-sensed odor, PEA.

      Comments

      This study will be of great interest in the field of C. elegans behavior, chemical senses and chemical ecology, and understanding of the sensory biology of foraging.

      Strengths:

      The identification of a receptor for IAA is an excellent finding. The combination of microbial metabolic chemistry and the use of natural bacteria and nematode strains makes an extremely compelling case for the ecological and adaptive relevance of the findings.

      Weaknesses:

      AWC receives synaptic input from other chemosensory neurons, and thus could potentially mediate navigation behaviors to compounds detected in whole or in part by those neurons. Language concluding detection by AWC should be moderated (e.g. p9 "worms sense an extensive repertoire...predominantly using AWC") unless it has been demonstrated.

      Thanks for your comment. We have modified the manuscript to incorporate the suggestion.

      srd-12 (snif-1) is not exclusively expressed in AWC. Normally, cell-specific rescue or knockdown would be used to demonstrate function in a specific cell. The authors should provide such a demonstration or explain why they are confident srd-12 (snif-1) acts in AWC.

      Thanks for the comment. We have performed AWC-specific rescue of snif-1 in mutant worms. As shown in Figure 6H, we found that AWC neurons specific rescue completely recovered the chemotaxis defect of the snif-1 mutant (referred as VSL2401) for IAA. In addition, snif-1 is expressed in one of the AWC neurons.

      A comparison of AWC's physiological responses between WT and srd-12 (snif-1), preferably in an unc13 background, would be nice. Even further, the expression of srd-12 (snif-1) in a different neuron type and showing that it confers responsiveness to IAA (in this case, inhibition) would be very convincing.

      Thanks for the suggestion. We have performed a receptor swap experiment, where snif-1 is misexpressed in AWB neurons. We find that these worms show slight but significant repulsion to IAA compared to WT and snif-1 mutant worms (Author response image 1).

      Recommendations for the authors:

      Reviewing Editor:

      Please consider all of the reviewer comments. In particular, as noted in the individual reviews, the strength of the evidence would be bolstered by additional experiments to demonstrate that the iLvE enzyme affects IAA levels in the preferred bacteria. The reviewers note that the authors haven't shown that IAA production is a reflection of leucine content. Are the non-preferred bacteria low on leucine or lack iLvE or IAA synthesis pathways? Further, more direct evidence that SRD-12 (SNIF-1) is in fact the primary IAA receptor would further strengthen the study. The authors should also be aware that geographic distance for wild isolate C. elegans may not directly correlate with phylogenetic distance. This should be assessed/discussed for the strains used.

      Thanks for the suggestions. Some of these have been addressed in response to reviewers. Thanks for your comments about possible disconnect between geographical and phylogenetic distances amongst natural isolates used here.

      By analyzing the phylogenetic tree generated using neighbor-joining algorithm available at CaeNDR database, we found that QX1211 and JU3226 are phylogenetically close, but the remaining isolates fall under different clades separated by long phylogenetic distances [7,8].  

      Reviewer #1 (Recommendations for the authors):

      (1) In the first sentence of the third paragraph of the introduction, C. elegans are described as "soildwelling." Although C. elegans has been described as soil-dwelling in the past, current research indicates they are most often found on rotten fruit, compost heaps and other bacterial-rich environments, not soil. "All Caenorhabditis species are colonizers of nutrient- and bacteria-rich substrates and none of them is a true soil nematode." from Kiontke, K. and Sudhaus, W. Ecology of Caenorhabditis species (WormBook).

      Your specific comment about C. elegans’ habitat is well received. However, in that sentence we are referring to the chemosensory system of soil-dwelling animals in general, and not particularly C. elegans.

      (2) Figure 3K, the model would be clearer if leucine-rich diet -> volatile chemicals ->AWC (instead of leucine-rich diet -> AWC <- volatile chemicals). The leucine-rich diet results in the production of volatile chemicals which are detected by AWC.

      We have modified the figure to make it clearer.

      (3) Figure 4 - it would help to include a table summarizing the volatile chemicals that each bacteria releases. Then the reader could more easily evaluate whether the adaptation to each specific odor is consistent with the change in preference for the specific bacteria based on what it releases in its headspace. In addition, Figure 4 would help to clarify whether bacteria in these experiments were cultured with or without leucine supplementation.

      Table S2 summarizes the odors released by all the bacteria under + LEU and – LEU conditions.

      In Figure 4, adaptation was performed by odors of bacteria when cultured under leucineunsupplemented conditions.

      Reviewer #2 (Recommendations for the authors):

      Page 9. Previous studies e.g. Bargmann Hartwieg and Horvitz have shown IAA is sensed by the AWC. Would be good to cite appropriately.

      Thanks for the comment. The reference has been cited at p9 and p16.

      References:

      (1) Yuan, J., Mishra, P., and Ching, C.B. (2017). Engineering the leucine biosynthetic pathway for isoamyl alcohol overproduction in Saccharomyces cerevisiae. Journal of Industrial Microbiology and Biotechnology 44, 107-117. 10.1007/s10295-016-1855-2 %J Journal of Industrial Microbiology and Biotechnology.

      (2) Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y., and Ishiguro-Watanabe, M. (2025). KEGG: biological systems database as a model of the real world. Nucleic Acids Res 53, D672-d677. 10.1093/nar/gkae909.

      (3) Shingai, R., Wakabayashi, T., Sakata, K., and Matsuura, T. (2005). Chemotaxis of Caenorhabditis elegans during simultaneous presentation of two water-soluble attractants, llysine and chloride ions. Comparative biochemistry and physiology. Part A, Molecular & integrative physiology 142, 308-317. 10.1016/j.cbpa.2005.07.010.

      (4) Dirksen, P., Assié, A., Zimmermann, J., Zhang, F., Tietje, A.M., Marsh, S.A., Félix, M.A., Shapira, M., Kaleta, C., Schulenburg, H., and Samuel, B.S. (2020). CeMbio - The Caenorhabditis elegans Microbiome Resource. G3 (Bethesda, Md.) 10, 3025-3039. 10.1534/g3.120.401309.

      (5) Edwards, C., Canfield, J., Copes, N., Brito, A., Rehan, M., Lipps, D., Brunquell, J., Westerheide, S.D., and Bradshaw, P.C. (2015). Mechanisms of amino acid-mediated lifespan extension in Caenorhabditis elegans. BMC genetics 16, 8. 10.1186/s12863-015-0167-2.

      (6) Wang, H., Wang, J., Zhang, Z.J.J.o.F., and Research, N. (2018). Leucine Exerts Lifespan Extension and Improvement in Three Types of Stress Resistance (Thermotolerance, AntiOxidation and Anti-UV Irradiation) in C. elegans. 6, 665-673.

      (7) Crombie, T.A., McKeown, R., Moya, N.D., Evans, Kathryn S., Widmayer, Samuel J., LaGrassa, V., Roman, N., Tursunova, O., Zhang, G., Gibson, Sophia B., et al. (2023). CaeNDR, the Caenorhabditis Natural Diversity Resource. Nucleic Acids Research 52, D850-D858. 10.1093/nar/gkad887 %J Nucleic Acids Research.

      (8) Cook, D.E., Zdraljevic, S., Roberts, J.P., and Andersen, E.C. (2017). CeNDR, the Caenorhabditis elegans natural diversity resource. Nucleic Acids Res 45, D650-d657. 10.1093/nar/gkw893.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      This work by Al-Jezani et al. focused on characterizing clonally derived MSC populations from the synovium of normal and osteoarthritis (OA) patients. This included characterizing the cell surface marker expression in situ (at time of isolation), as well as after in vitro expansion. The group also tried to correlate marker expression with trilineage differential potential. They also tested the ability of the different subpopulations for their efficacy in repairing cartilage in a rat model of OA. The main finding of the study is that CD47hi MSCs may have a greater capacity to repair cartilage than CD47lo MSCs, suggesting that CD47 may be a novel marker of human MSCs that have enhanced chondrogenic potential. 

      Strengths: 

      Studies on cell characterization of the different clonal populations isolated indicate that the MSC are heterogenous and traditional cell surface markers for MSCs do not accurately predict the differentiation potential of MSCs. While this has been previously established in the field of MSC therapy, the authors did attempt to characterize clones derived from single cells, as well as evaluate the marker profile at the time of isolation. While the outcome of heterogeneity is not surprising, the methods used to isolate and characterize the cells were well developed. The interesting finding of the study is the identification of CD47 as a potential MSC marker that could be related to chondrogenic potential. The authors suggest that MSCs with high CD47 repaired cartilage more effectively than MSC with low CD47 in a rat OA model. 

      Weaknesses: 

      While the identification of CD47 as a novel MSC marker could be important to the field of cell therapy and cartilage regeneration, there was a lack of robust data to support the correlation of CD47 expression to chondrogenesis. The authors indicated that the proteomics suggested that the MSC subtype expressed significantly more CD47 than the non-MSC subtype. However, it was difficult to appreciate where this was shown. It would be helpful to clearly identify where in the figure this is shown, especially since it is the key result of the study. The authors were able to isolate CD47hi and CD47 low cells. While this is exciting, it was unclear how many cells could be isolated and whether they needed to be expanded before being used in vivo. Additional details for the CD47 studies would have strengthened the paper. Furthermore, the CD47hi cells were not thoroughly characterized in vitro, particularly for in vitro chondrogenesis. More importantly, the in vivo study where the CD47hi and CD47lo MSCs were injected into a rat model of OA lacked experimental details regarding how many cells were injected and how they were labeled. No representative histology was presented and there did not seem to be a statistically significant difference between the OARSI score of the saline injected and MSC injected groups. The repair tissue was stained for Sox9 expression, which is an important marker of chondrogenesis but does not show production of cartilage. Expression of Collagen Type II would be needed to more robustly claim that CD47 is a marker of MSCs with enhanced repair potential. 

      Reviewer #2 (Public review): 

      Summary: 

      This is a compelling study that systematically characterized and identified clonal MSC populations derived from normal and osteoarthritis human synovium. There is immense growth in the focus on synovial-derived progenitors in the context of both disease mechanisms and potential treatment approaches, and the authors sought to understand the regenerative potential of synovial-derived MSCs. 

      Strengths: 

      This study has multiple strengths. MSC cultures were established from an impressive number of human subjects, and rigorous cell surface protein analyses were conducted, at both pre-culture and post-culture timepoints. In vivo experiments using a rat DMM model showed beneficial therapeutic effects of MSCs vs non-MSCs, with compelling data demonstrating that only "real" MSC clones incorporate into cartilage repair tissue and express Prg4. Proteomics analysis was performed to characterize non-MSC vs MSC cultures, and high CD47 expression was identified as a marker for MSC. Injection of CD47-Hi vs CD47-Low cells in the same rat DMM model also demonstrated beneficial effects, albeit only based on histology. A major strength of these studies is the direct translational opportunity for novel MSC-based therapeutic interventions, with high potential for a "personalized medicine" approach. 

      Weaknesses: 

      Weaknesses of this study include the rather cursory assessment of the OA phenotype in the rat model, confined entirely to histology (i.e. no microCT, no pain/behavioral assessments, no molecular readouts). It is somewhat unclear how the authors converged on CD47 vs the other factors identified in the proteomics screen, and additional information is needed to understand whether true MSCs only engraft in articular cartilage or also in ectopic cartilage (in the context of osteophyte/chondrophyte formation). Some additional discussion and potential follow-up analyses focused on other cell surface markers recently described to identify synovial progenitors is also warranted. A conceptual weakness is the lack of discussion or consideration of the multiple recent studies demonstrating that DPP4+ PI16+ CD34+ stromal cells (i.e. the "universal fibroblasts") act as progenitors in all mesenchymal tissues, and their involvement in the joint is actively being investigated. Thus, it seems important to understand how the MSCs of the present study are related to these DPP4+ progenitors. Despite these areas for improvement, this is a strong paper with a high degree of rigor, and the results are compelling, timely, and important. 

      Overall, the authors achieved their aims, and the results support not just the therapeutic value of clonally-isolated synovial MSCs but also the immense heterogeneity in stromal cell populations (containing true MSCs and non-MSCs) that must be investigated further. Of note, the authors employed the ISCT criteria to characterize MSCs, with mixed results in pre-culture and post-culture assessments. This work is likely to have a longterm impact on methodologies used to culture and study MSCs, in addition to advancing the field's knowledge about how synovial-derived progenitors contribute to cartilage repair in vivo.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      In all figures, it would be beneficial to report the sample number used for the data reported. It is difficult to appreciate the statistical analysis without that information.

      Understood, the sample number and replicates have been added to each figure legend.

      Please check that Table S7 is part of the manuscript. It could not be found.

      It was added as an additional excel file since it was too large to fit in the word document.

      Lines 377-379 (Figure 2E): the authors write that rats receiving MSCs had a significantly lower OARSI and Krenn score vs. rats injected with non-MSCs. However, none of the bars indicating statistical significance run between these two groups. Please verify the text and figure.

      This has been corrected

      The details surrounding the labeling of the cells with tdTomato were not presented in the methods. 

      This has been added to the methods

      The fluorescent antibodies used should be listed and more details provided in the methods rather than a general statement that fluorescent antibodies were used.

      Our apologies, the clones and companies have been added.

      Additional information on the CD47 experiments (# cells, # animals) would have strengthened the study.

      This has been added to the methods and figure legend.

      Reviewer #2 (Recommendations for the authors): 

      My comments span minor corrections, requests for additional analyses, some suggestions for additional experiments, and requests for additional discussion of recent important studies. 

      Introduction: 

      The introduction is thorough and well-written. I recommend a brief discussion about the emerging evidence demonstrating that DPP4+ PI16+ CD34+ synovial cells, i.e. the "universal fibroblasts", act as stromal progenitors in development, homeostasis, and disease. Relevant osteoarthritis-related papers encompass human and mouse studies (PMIDs: 39375009, 38266107, 38477740, 36175067, 36414376).

      This has been added.

      Relatedly, as DPP4 is CD26 and therefore useful as a cell-surface antigen for flow cytometry, sorting, etc, it would be interesting to understand the relationship and similarities between the CD47-High cells identified in this study and the DPP4/PI16+ cells previously described. Do they overlap in phenotype/identity?

      We have added a new flow cytometry figure for address this question. 

      Results: 

      Note type-o on Line 311: "preformed" instead of "performed". Line 313 "prolife" instead of "profile"

      Thank you for catching these.

      The identified convergence of the cell surface marker profile of all normal and OA clones in culture is a highly intriguing result. Do the authors have stored aliquots of these cells to demonstrate whether this would also occur in soft substrate, i.e. low stiffness culture conditions? This could be done with standard dishes coated with bulk collagen or with commercially available low-stiffness dishes (1 kPa). This is relevant to multiple studies demonstrating the induction of a myofibroblast-like phenotype by stromal cells cultured on high-stiffness plastic or glass. This is also the experiment where assessment of DPP4/CD26 could be added, if possible.

      While we agree it would be interesting to determine the mechanism by which the cells phenotypes converge, we would argue that it is outside of the scope of the current manuscript. We have instead added a sentence to the discussion. 

      Line 353 regarding the use of CD68 as a negative gate: can the authors pleasecomment on why they employed CD68 here and not CD45? While monocytes/macs/myeloid cells are the most abundant immune cells in synovium, CD45 would more comprehensively exclude all immune cells. 

      That is a fair point, and we really don’t have any reason to have picked CD68 over CD45. In our opinion either would be a fair negative marker to use based on the literature. 

      Fig 2, minor suggestion: consider adding "MSC vs non-MSC" on the experimental schematic to more comprehensively summarize the experiment. 

      This has been modified 

      Fig 2E should show all individual datapoints, not just bar graphs. 

      This has been modified

      Fig 2: Given the significant reduction in Krenn score in DMM-MSC injected knees compared to DMM-saline knees, Fig 2 should also show representative images of the synovial phenotype to demonstrate which aspects of synovial pathology were mitigated. Was the effect related to lining hyperplasia, subsynovial infiltrate, fibrosis, etc? Similarly, can the authors narrate which aspects of the OARSI score drove the treatment effect (proteoglycans vs structure vs osteophytes, etc). 

      We have added a new sup figure breaking down the Krenn score as well as higher magnification images of representative synovium.

      Fig 2: In the absence of microCT imaging, can the authors quantify subchondral bone morphometrics using multiple histological sections? The tibial subchondral bone in Fig 2D appears protected from sclerosis/thickening.

      Unfortunately, this is beyond what are able to add to the manuscript. 

      The Fig 3 results are highly compelling and interesting. Congratulations.

      Thank you very much.

      Fig 4A: the cell highlighted in the high-mag zoom box in Fig 4A appears to be localized within the joint capsule or patellar tendon (it is unclear which anatomic region this image represents). The highly aligned nature of the tissue and cells along a fibrillar geometry indicates that this is not synovium. The interface between synovium and the tissue in question can be clearly observed in this image. Please choose an image more representative of synovium.

      We completely agree with the reviewers assessment. However, it is the synovium that overlays this tissue (Fig 4A arrow). We are simply showing that there were very few MSCs that took up residence in the synovium or the adjacent tissues. 

      Fig 4C and F: please show individual data points.

      This has been added

      Fig 5D: I see DPP4 and ITGA5 were also hits in the proteomics analysis, which is intriguing. Besides my comments/suggestions regarding DPP4 above, please note this recent paper identifying a ITGA5+ synovial fibroblast subset that orchestrates pathological crosstalk with lymphocytes in RA, PMID: 39486872

      Thank you for the information. We have added the reference in the results section. 

      Fig 5B-D: How did the authors converge on CD47 as the target for follow-up study? It does not appear to be a differentially-expressed protein based on the Volcano plot in Fig 5B, and it's unclear why it is a more important factor than any of the other proteins shown in the network diagram in Fig 5D, e.g. CTSL, ITGA5, DPP4. Can the authors add a quantitative plot supporting their statement "the MSC sub-type expressed significantly more CD47 than the non-MSCs" on Line 458? 

      We have re-written this line. It was incorrect to discuss amount of CD47. That was shown later with the flow analysis.  

      Fig 6D: Please show individual data points and also representative histology images to demonstrate the nature of the phenotypic effect.

      This has been added. 

      Fig 6E-F: In what anatomic region are these images? Please add anatomic markers to clarify the location and allow the reader to interpret whether this is articular cartilage or ectopic cartilage

      We have redone the figure to show the area as requested.

      Relevant to this, do the authors observe this type of cellular engraftment in ectopic cartilage/osteophytes or only in articular cartilage? Understanding the contribution of these cells to the formation/remodeling of various cartilage types in the context of OA is a critical aspect of this line of investigation.

      We didn’t see any contribution of these cells to ectopic cartilage formation and are actively working on a follow up study discussing this point specifically. 

      Discussion: 

      Besides my comments regarding DPP4 and ITGA5 above, the authors may also consider discussing PMID: 37681409 (JCI Insight 2023), which demonstrates that adult Prg4+ progenitors derived from synovium contribute to articular cartilage repair in vivo. 

      We agree that there are numerous markers we could look at in future studies and that other people in the field are actively studying.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary:

      The manuscript submitted by Langenbacher et al., entitled " Rtf1-dependent transcriptional pausing regulates cardiogenesis", describes very interesting and highly impactful observations about the function of Rtf-1 in cardiac development. Over the last few years, the Chen lab has published novel insights into the genes involved in cardiac morphogenesis. Here, they used the mouse model, the zebrafish model, cellular assays, single cell transcription, chemical inhibition, and pathway analysis to provide a comprehensive view of Rtf1 in RNAPII (Pol2) transcription pausing during cardiac development. They also conducted knockdown-rescue experiments to dissect the functions of Rtf1 domains. 

      Strengths:

      The most interesting discovery is the connection between Rtf1 and CDK9 in regulating Pol2 pausing as an essential step in normal heart development. The design and execution of these experiments also demonstrate a thorough approach to revealing a previously underappreciated role of Pol2 transcription pausing in cardiac development. This study also highlights the potential amelioration of related cardiac deficiencies using small molecule inhibitors against cyclin dependent kinases, many of which are already clinically approved, while many other specific inhibitors are at various preclinical stages of development for the treatment of other human diseases. Thus, this work is impactful and highly significant. 

      We thank the reviewer for appreciating our work.

      Reviewer #2 (Public Review): 

      Summary: 

      Langenbacher at el. examine the requirement of Rtf1, a component of the PAF1C, which regulates transcriptional pausing in cardiac development. The authors first confirm their previous morphant study with newly generated rtf1 mutant alleles, which recapitulate the defects in cardiac progenitor and diUerentiation gene expression observed previously in morphants. They then examine the conservation of Rtf1 in mouse embryos and embryonic stem cell-derived cardiomyocytes. Conditional loss of Rtf1 in mesodermal lineages and depletion in murine ESCs demonstrates a failure to turn on cardiac progenitor and diUerentiation marker genes, supporting conservation of Rtf1 in promoting cardiac development. The authors subsequently employ bulk RNA-seq on flow-sorted hand2:GFP+ cells and multiomic single-cell RNA-seq on whole Rtf1-depleted embryos at the 10-12 stage. These experiments corroborate that genes associated with cardiac and muscle development are lost. Furthermore, the diUerentiation trajectories suggest that the expression of genes associated with cardiac maturation is not initiated.  Structure-function analysis supports that the Plus3 domain is necessary for its function in promoting cardiac progenitor formation. ChIP-seq for RNA Pol II on 1012 somite stage embryos suggests that Rtf1 is required for proper promoter pausing. This defect can partially be rescued through use of a pharmacological inhibitor for Cdk9, which inhibits elongation, can partially restore elongation in rtf1 mutants.  

      Strengths: 

      Many aspects of the data are strong, which support the basic conclusions of the authors that Rtf1 is required for transcriptional pausing and has a conserved requirement in vertebrate cardiac development. Areas of strength include the genetic data supporting the conserved requirement for Rtf1 in promoting cardiac development, the complementary bulk and single-cell RNA-sequencing approaches providing some insight into the gene expression changes of the cardiac progenitors, the structure-function analysis supporting the requirement of the Plus3 domain, and the pharmacological epistasis combined with the RNA Pol II ChIP-seq, supporting the mechanism implicating Cdk9 in the Rtf1 dependent mechanism of RNA Pol II pausing. 

      We thank the reviewer for the summary and for recognizing many strengths of our work. 

      Weaknesses: 

      While most of the basic conclusions are supported by the data, there are a number of analyses that are confusing as to why they chose to perform the experiments the way they did and some places where the interpretations presently do not support the interpretations. One of the conclusions is that the phenotype aUects the maturation of the cardiomyocytes and they are arresting in an immature state. However, this seems to be mostly derived from picking a few candidates from the single cell data in Fig. 6. If that were the case, wouldn't the expectation be to observe relatively normal expression of earlier marker genes required for specification, such as Nkx2.5 and Gata5/6? The in situ expression analysis from fish and mice (Fig. 2 and Fig. 3) and bulk RNA-seq (Fig. 5) seems to suggest that there are pretty early specification and diUerentiation defects. While some genes associated with cardiac development are not changed, many of these are not specific to cardiomyocyte progenitors and expressed broadly throughout the ALPM. Similarly, it is not clear why a consistent set of cardiac progenitor genes (for instance mef2ca, nkx2.5, and tbx20) was analyzed for all the experiments, in particular with the single cell analysis. 

      A major conclusion of our study is that Rtf1 deficiency impairs myocardial lineage differentiation from mesoderm, as suggested by the reviewer. Thus, the main goal of this study is to understand how Rtf1 drives cardiac differentiation from the LPM, rather than the maturation of cardiomyocytes.  Multiple lines of evidence support this conclusion:

      (a) In situ hybridization showed that Rtf1 mutant embryos do not have nkx2.5+ cardiac progenitor cells and subsequently fail to produce cardiomyocytes (Figs. 2, 3).

      (b) RT-PCR analysis showed that knockdown of Rtf1 in mouse embryonic stem cells causes a dramatic reduction of cardiac gene expression and production of significantly fewer beating patches (Fig.4).

      (c) Bulk RNA sequencing revealed significant downregulation of cardiac lineage genes, including nkx2.5 (Fig. 5).

      (d) Single cell RNA sequencing clearly showed that lateral plate mesoderm (LPM) cells are significantly more abundant in Rtf1 morphant,s whereas cardiac progenitors are less abundant (Fig. 6 and Fig.6 Supplement 1-5). 

      When feasible, we used cardiac lineage restricted markers in our assays. Nkx2.5 and tbx5a are not highlighted in the single cell analysis because their expression in our sc-seq dataset was too low to examine in the clustering/trajectory analysis.  In this revised manuscript, we provide violin plots showing the low expression levels of these genes in single cells from Rtf1 deficient embryos (Figure 6 Supplement 5).

      The point of the multiomic analysis is confusing. RNA- and ATAC-seq were apparently done at the same time. Yet, the focus of the analysis that is presented is on a small part of the RNA-seq data. This data set could have been more thoroughly analyzed, particularly in light of how chromatin changes may be associated with the transcriptional pausing. This seems to be a lost opportunity. Additionally, how the single cell data is covered in Supplemental Fig. 2 and 3 is confusing. There is no indication of what the diUerent clusters are in the Figure or the legend. 

      In this study, we performed single cell multiome analysis and used both scRNAseq and scATACseq datasets to generate reliable clustering.  The scRNAseq analysis reveals how Rtf1 deficiency impacts cardiac differentiation from mesoderm, which inspired us to investigate the underlying mechanism and led to the discovery of defects in Rtf1-dependent transcriptional pause release.

      We agree with the reviewer that deep examination of Rtf1-dependent chromatin changes would provide additional insights into how Rtf1 influences early development and careful examination of the scATACseq dataset is certainly a good future direction.  

      In this revised manuscript, we have revised Fig.6 Supplement 1 to include the predicted cell types and provide an additional excel file showing the annotation of all 39 clusters (Supplementary Table 2). 

      While the effect of Rtf1 loss on cardiomyocyte markers is certainly dramatic, it is not clear how well the mutant fish have been analyzed and how specific the eUect is to this population. It is interpreted that the eUects on cardiomyocytes are not due to "transfating" of other cell fates, yet supplemental Fig. 4 shows numerous eUects on potentially adjacent cell populations. Minimally, additional data needs to be provided showing the live fish at these stages and marker analysis to support these statements. In some images, it is not clear the embryos are the same stage (one can see pigmentation in the eyes of controls that is not in the mutants/morphants), causing some concern about developmental delay in the mutants. 

      Single cell RNA sequencing showed an increased abundance of LPM cells and a reduced abundance of cardiac progenitors in Rtf1 morphants (Fig. 6 and Fig.6 Supplement 1-5). The reclustering of anterior lateral plate mesoderm (ALPM) cells and their derivatives further showed that cells representing undiRerentiated ALPM were increased whereas cells representing all three ALPM derivatives were reduced. These findings indicate a defect in ALPM diRerentiation. 

      The reviewer questioned whether we examined stage-matched embryos. In our assay, Rtf1 mutant embryos were collected from crosses of Rtf1 heterozygotes. Each clutch from these crosses consists of ¼ embryos showing rtf1 mutant phenotypes and ¾ embryos showing wild type phenotypes which were used as control. Mutants and their wild type siblings were fixed or analyzed at the same time.

      The reviewer questioned the specificity of the Rtf1 deficient cardiac phenotype and pointed out that Rtf1 mutant embryos do not have pigment cells around the eye.  Rtf1 is a ubiquitously expressed transcriptional regulator.  Previous studies in zebrafish have shown that Rtf1 deficiency significantly impacts embryonic development. Rtf1 deficiency causes severe defects in cardiac lineage and neural crest cell development; consequently, Rtf1 deficient embryos do not have cardiomyocytes and pigmentation (Langenbacher et al., 2011, Akanuma et al., 2007, and Jurynec et al., 2019).  We now provide an image showing a 2-day-old Rtf1 mutant embryo and their wild type sibling to illustrate the cardiac, neural crest, and somitogenesis defects caused by loss of Rtf1 activity (Fig. 2 Supplement 1).

      With respect to the transcriptional pausing defects in the Rtf1 deficient embryos, it is not clear from the data how this eUect relates to the expression of the cardiac markers. This could have been directly analyzed with some additional sequencing, such as PRO-seq, which would provide a direct analysis of transcriptional elongation. 

      We showed that Rtf1 deficiency results in a nearly genome-wide decrease in promoterproximal pausing and downregulation of cardiac makers. Attenuating transcriptional pause release could restore cardiomyocyte formation in Rtf1 deficient embryos. In this revised manuscript, we provide additional RNAseq data showing that the expression levels of critical cardiac development genes such as nkx2.5, tbx5a, tbx20, mef2ca, mef2cb, ttn.2, and ryr2b are significantly rescued.  We agree with the reviewer that further analyses using the PRO-seq approach could provide additional insights, but it is beyond the scope of this manuscript. 

      Some additional minor issues include the rationale that sequence conservation suggests an important requirement of a gene (line 137), which there are many examples this isn't the case, referencing figures panels out of order in Figs. 4, 7, and 8) as described in the text, and using the morphants for some experiments, such as the rescue, that could have been done in a blinded manner with the mutants. 

      We have clarified the rationale in this revised manuscript and made the eRort to reference figures in order. 

      The reviewer commented that rescue experiments “could have been done in a blinded manner with the mutants”. This was indeed how the flavopiridol rescue and cdk9 knockdown experiments were carried out. Embryos from crosses of Rtf1 heterozygotes were collected, fixed after treatment and subjected to in situ hybridization. Embryos were then scored for cardiac phenotype and genotyped (Fig.8 d-g). Morpholino knockdown was used in genomic experiments because our characterization of rtf1 morphants showed that they faithfully recapitulate the rtf1 mutant phenotype during the timeframe of interest (Fig. 2).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      This reviewer has a few suggestions below, aimed at improving the clarity and impact of the current study. Once these items are addressed, the manuscript should be of interest to the Elife reader. 

      Item 1. Strengthening the interaction between Rfh1 and CDK9 on Pol2 pausing. 

      The authors have convincingly shown that the chemical inhibition of CDK9 by flavopiridol can partially rescue the expression of cardiac genes in the zebrafish model. Although flavopiridol is FDA approved and has been a classical inhibitor for the dissection of CDK9 function, it also inhibits related CDKs (such as Flavopiridol (Alvocidib) competes with ATP to inhibit CDKs including CDK1, CDK2, CDK4, CDK6, and CDK9 with IC50 values in the 20-100 nM range) Therefore, this study could be more impactful if the authors can provide evidence on which of these CDKs may be most relevant during Rtf1-dependent cardiogenesis. To determine whether the observed cardiac defect indicates a preferential role for CDK9, or that other CDKs may also be able to provide partial rescue may be clarified using additional, more selective small molecules (e.g., BAY1251152, LDC000067 are commercially available). 

      The reviewer raised a reasonable concern about the specificity of flavopiridol. We thank the reviewer for the insightful suggestion and share the concern about specificity. To address this question, we have used an orthogonal testing through morpholino inhibition where we directly targeted CDK9 and observed the same level of rescue, supporting a critical role of transcription pausing in cardiogenesis.

      Item 2. Differences between CRISPR lines and morphants 

      Much of the work presented used Rtf1 morphants while the authors have already generated 2 CRISPR lines. What is the diUerence between morphants and mutants? The authors should comment on the similarities and/or differences between using morphants or mutants in their study and whether the same Rtf1- CDK9 connection also occurs in the CRISPR lines. 

      The morphology of our mutants (rtf1<sup>LA2678</sup> and rtf1<sup>LA2679</sup>) resembles the morphants and the previously reported ENU-induced rtf1<sup>KT641</sup> allele. Extensive in situ hybridization analysis showed that the morphants faithfully recapitulate the mutant phenotypes (Fig.2). We have performed rescue experiments (flavopiridol and CDK9 morpholino) using Rtf1 mutant embryos and found that inhibiting Cdk9 restores cardiomyocyte formation (Fig.8). 

      Item 3. Discuss the therapeutic relevance of study 

      The authors have already generated a mouse model of Rtf1 Mesp1-Cre knockout where cardiac muscle development is severely derailed (Fig 3B). Thus, a demonstration of a conserved role for CDK9 inhibitor in rescuing cardiogenesis using mouse cells or the mouse model will provide important information on a conserved pathway function relevant to mammalian heart development. In the Discussion, how this underlying mechanistic role may be useful in the treatment of congenital heart disease should be provided.  

      Thank you for the insight. We have incorporated your comments in the discussion. 

      Item 4. Insights into the role of CDK9-Rtf1 in response to stress versus in cardiogenesis. 

      In the Discussion, the authors commented on the role of additional stress-related stimuli such as heat shock and inflammation that have been linked to CDK9 activity. However, the current ms provides the first, endogenous role of Pol2 pausing in a critical developmental step during normal cardiogenesis. The authors should emphasize the novelty and significance of their work by providing a paragraph on the state of knowledge on the molecular mechanisms governing cardiogenesis, then placing their discovery within this framework. This minor addition will also clarify the significance of this work to the broad readership of eLife. 

      Thank you for the suggestion. We have incorporated your comments and elaborate on the novelty and significance of our work in the discussion. 

      Reviewer #2 (Recommendations For The Authors): 

      (1) It is diUicult to assess what the overt defects are in the embryos at any stages. Images of live images were not included in the supplement. Do these have a small, malformed heart tube later or are the embryos just deteriorating due to broad defects? 

      The Rtf1 deficient embryos do not produce nkx2.5+ cardiac progenitors. Consequently, we never observed a heart tube or detected cells expressing cardiomyocyte marker genes such as myl7. This finding is consistent with previous reports using rtf1 morphants and rtf<sup>1KT64</sup>, an ENU-induced point mutation allele (Langenbacher et al., 2011 and Akanuma, 2007). In this revised manuscript, we provide a live image of 2-day-old wild type and rtf1<sup>LA2679/LA2679</sup> embryos (Fig. 2 Supplement 1). After two days, rtf1 mutant embryos undergo broad cell death. 

      (2) Fig. 2, although the in situs are convincing, there is not a quantitative assessment of expression changes for these genes. This could have been done for the bulk or single cell RNA-seq experiments, but was not and these genes weren't not included in the heat maps. A quantitative assessment of these genes would benefit the study. 

      The top 40 most significantly diRerentially expressed genes are displayed in the heatmap presented in Fig.5d. The complete diRerential gene expression analysis results for our hand2 FACS-based comparison of rtf1 morphants and controls is presented in Supplementary Data File 1.  In this revised manuscript, we provide a new supplemental figure with violin plots showing the expression levels of genes of interest in our single cell sequencing dataset (Fig.6 Supplement 5).

      (3) It doesn't not appear that any statistical tests were used for the comparisons in Fig. 2.

      We now provide the statistical data in the legend and Fig.2 b, d, f, h and i.

      (4) It's not clear the magnifications and orientations of the embryos in Fig. 3b are the same. 

      Embryos shown in Fig.3b are at the same magnification. However, because Rtf1 mutant embryos display severe morphological defects, the orientation of mutant embryos was adjusted to examine the cardiac tissue.

      (5) The n's for analysis of MLC2v in WT Rtf1 CKO embryos in Fig. 3b are only 1. At least a few more embryos should be analyzed to confirm that the phenotype is consistent. 

      We have revised the figure and present the number of embryos analyzed and statistics in Fig.3c. 

      (6) A number of figure panels are referred to out of order in the text. Fig. 4E-G are before Fig. 4C, D, Fig. 7C  before 7B, Fig. 8D-I before 8A ,B. In general, it is easier for the reader if the figures panels are presented in the order they are referred to in the text. 

      Revised as suggested.

      (7) While additional genes can be included, it is not clear why the same sets of genes are not examined in the bulk or single-cell RNA-seq as with the in situs or expression was analyzed in embryos. I suggest including the genes like nkx2.5, tbx20, myl7, in all the sequencing analysis. 

      We used the same set of genes in all analyses when possible. However, the low expression of genes such as nkx2.5 and myl7 in our sc-seq dataset preclude them from the clustering/trajectory analysis. In this revised manuscript, we present violin plots showing their expression in wild type and rtf1 morphants (Fig. 6 Supplement 5).

      (8) If a multiomic approach was used, why wasn't its analysis incorporated more into the manuscript? In general, a clearer presentation and deeper analysis of the single cell data would benefit the study. The integration of the RNA and ATAC would benefit the analysis.

      As addressed in our response to the reviewer’s public review, both datasets were used in clustering. Examining changes in chromatin accessibility is certainly interesting, but beyond the scope of this study. 

      (9) Many of the markers analyzed are not cardiac specific or it is not clear they are expressed in cardiac progenitors at the stage of the analysis. Hand2 has broader expression. Additional confirmation of some of the genes through in situ would help the interpretations. 

      Markers used for the in situ hybridization analysis (myl7, mef2ca, nkx2.5, tbx5a, and tbx20) are known for their critical role in heart development. For sc-seq trajectory analyses, most displayed genes (sema3e, bmp6, ttn.2, mef2cb, tnnt2a, ryr2b, and myh7bb) were identified based on their diRerential expression along the LPM-cardiac progenitor pseudotime trajectory. Rather than selecting genes based on their cardiac specificity, our goal was to examine the progressive gene expression changes associated with cardiac progenitor formation and compare gene expression of wild type and rtf1 deficient embryos.

      (10) Additional labels of the cell clusters are needed for Supplemental Figs. 2 and 3. 

      The cluster IDs were presented on Supplementary Figures 2 and 3. In this revised version, we added predicted cell types to the UMAP (revised Fig.6 Supplement 1) and provided an excel file with this information (revised Supplementary Table 2). 

      (11) On lines 101-102, the interpretation from the previous data is that diUerentiation of the LPM requires Rtf1. However, later from the single cell data the interpretation based on the markers is that Rtf1 loss aUects maturation. However, it is not clear this interpretation is correct or what changed from the single cell data. If that were the case, one would expect to see maintenance of more early marks and subsequent loss of maturation markers, which does not appear to the be the case from the presented data.

      Our data suggests that cardiac progenitor formation is not accomplished by simultaneously switching on all cardiac marker genes. Our pseudotime trajectory analysis highlights tnnt2a, ryr2b, and myh7bb as genes that increase in expression in a lagged manner compared to mef2cb (Fig. 6). Thus, the abnormal activation of mef2cb without subsequent upregulation of tnnt2a, ryr2b, and myh7bb in rtf1 morphants suggests a requirement for rtf1 in the progressive gene expression changes required for proper cardiac progenitor diRerentiation. Our single cell experiment focuses on the process of cardiac progenitor diRerentiation and does not provide insights into cardiomyocyte maturation. We have edited the text to clarify these interpretations. 

      (12) The interpretation that there is not "transfating" is not supported by the shown data. Analysis of markers in other tissues, again with in situ, to show spatially would benefit the study. 

      As stated in our response to the reviewer’s public review, we observed a dramatic increase of ALPM cells, but a decrease of ALPM derivatives including the cardiac lineage. We did not observe the expansion of one ALPM-derived subpopulation at the expense of the others. These observations suggest a defect in ALPM diRerentiation and argue against the notion that the region of the ALPM that would normally give rise to cardiac progenitors is instead diRerentiating into another cell type.

      (13) The rationale that sequence conservation means a gene is important (lines 137-139) is not really true. There are examples a lot of highly conserved genes whose mutants don't have defects. 

      We have revised the text to avoid confusion. 

      (14) The data showing that the 8 bp mutations do not aUect the RNA transcript is not shown or at least indicated in Fig. 7. It would seem that this experiment could have been done in the mutant embryos, in which case the experiment would have been semi-blinded as the genotyping would occur after imaging. 

      The modified Rtf1 wt RNA (Rtf1 wt* in revised Fig. 7) robustly rescued nkx2.5 expression in rtf1 deficient embryos, demonstrating that the 8 bp modifications do not negatively impact the activity of the injected RNA. As stated previously, morpholino knockdown was used in some experiments because our characterization of rtf1 morphants showed that they faithfully recapitulate the rtf1 mutant phenotype during the timeframe of interest.

      (15) Using a technique like PRO-seq at the same stage as the ChIP-seq would complement the ChIP-seq and allow a more detailed analysis of the transcriptional pausing on specific genes observed in WT and mutant embryos. 

      As stated in our response to the reviewer’s public review, we appreciate the suggestion but PRO-seq is beyond the scope of this study.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Review:

      Reviewer #1 (Public review):

      The authors used fluorescence microscopy, image analysis, and mathematical modeling to study the effects of membrane affinity and diffusion rates of MinD monomer and dimer states on MinD gradient formation in B. subtilis. To test these effects, the authors experimentally examined MinD mutants that lock the protein in specific states, including Apo monomer (K16A), ATP-bound monomer (G12V) and ATP-bound dimer (D40A, hydrolysis defective), and compared to wild-type MinD. Overall, the experimental results support the conclusions that reversible membrane binding of MinD is critical for the formation of the MinD gradient, but the binding affinities between monomers and dimers are similar.

      The modeling part is a new attempt to use the Monte Carlo method to test the conditions for the formation of the MinD gradient in B. subtilis. The modeling results provide good support for the observations and find that the MinD gradient is sensitive to different diffusion rates between monomers and dimers. This simulation is based on several assumptions and predictions, which raises new questions that need to be addressed experimentally in the future.  

      Reviewer #3 (Public review):

      This important study by Bohorquez et al examines the determinants necessary for concentrating the spatial modulator of cell division, MinD, at the future site of division and the cell poles. Proper localization of MinD is necessary to bring the division inhibitor, MinC, in proximity to the cell membrane and cell poles

      where it prevents aberrant assembly of the division machinery. In contrast to E. coli, in which MinD 50 oscillates from pole-to-pole courtesy of a third protein MinE, how MinD localization is achieved in B. 51 subtilis-which does not encode a MinE analog-has remained largely a mystery. The authors present 52 compelling data indicating that MinD dimerization is dispensable for membrane localization but required 53 for concentration at the cell poles. Dimerization is also important for interactions between MinD and MinC, 54 leading to the formation of large protein complexes. Computational modeling, specifically a Monte Carlo 55 simulation, supports a model in which differences in diffusion rates between MinD monomers and dimers 56 lead to concentration of MinD at cell poles. Once there, interaction with MinC increases the size of the 57 complex, further reinforcing diffusion differences. Notably, interactions with MinJ-which has previously 58 been implicated in MinCD localization, are dispensable for concentrating MinD at cell poles although MinJ may help stabilize the MinCD complex at those locations.

      Comments on revisions:  

      I believe the authors put respectable effort into revisions and addressing reviewer comments, particularly 64      those that focused on the strengths of the original conclusions. The language in the current version of the manuscript is more precise and the overall product is stronger.  

      We are happy to learn that the reviewer considers our manuscript ready for publication.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):  

      The author has adequately answered the questions that were raised in my previous comments. There are only few minor revisions needed for improvement.  

      Line 48−49: 'These proteins ensure that cell division occurs at midcell and not close to nascent division sites or cell poles'  

      delete 'nascent division site'  

      This has now been corrected as suggested.

      Line 64−65: 'MinC inhibits polymerization of FtsZ by direct protein-protein interactions and needs to bind to the Walker A-type ATPase MinD for its recruitment to septa or the polar regions of the cell'

      delete 'septa or', because MinD recruits MinC to the cell poles to block polar division, not septal formation.  

      This has now been corrected as suggested.

      Supplemental information:

      Some parameters in Table S1 are missing definitions. If these parameters relate to terms described in the "Methods" section, please add the corresponding parameter symbols after the terms.  

      We would like to thank the reviewer for pointing this out. We have improved Table S1 and corrected the related parameters in the Methods section (lines 605-619).

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      This manuscript investigates the biological mechanism underlying the assembly and transport of the AcrAB-TolC efflux pump complex. By combining endogenous protein purification with cryo-EM analysis, the authors show that the AcrB trimer adopts three distinct conformations simultaneously and identify a previously uncharacterized lipoprotein, YbjP, as a potential additional component of the complex. The work aims to advance our understanding of the AcrAB-TolC efflux system in near-native conditions and may have broader implications for elucidating its physiological mechanism. 

      Strengths: 

      Overall, the manuscript is clearly presented, and several of the datasets are of high quality. The use of natively isolated complexes is a major strength, as it minimizes artifacts associated with reconstituted systems and enables the discovery of a novel subunit. The authors also distinguish two major assemblies-the TolC-YbjP sub-complex and the complete pump-which appear to correspond to the closed and open channel states, respectively. The conceptual advance is potentially meaningful, and the findings could be of broad interest to the field. 

      Weaknesses: 

      (1) As the identification of YbjP is a key contribution of this work, a deeper comparison with functional "anchor" proteins in other efflux pumps is needed. Including an additional supplementary figure illustrating these structural comparisons would be valuable. 

      We appreciate this helpful suggestion. We will expand the comparative analysis between YbjP and established anchoring or accessory components in other efflux pumps, and we will add a new supplementary figure illustrating these structural relationships.

      (2) The observation of the LTO states in the presence of TolC represents an important extension of previous findings. A more detailed discussion comparing these LTO states to those reported in earlier structural and biochemical studies would improve the clarity and significance of this point. 

      We agree. In the revised manuscript we will expand our discussion of the LTO conformations, including a direct comparison with previously reported structural and biochemical observations, to better contextualize the significance of our findings.

      Reviewer #2 (Public review): 

      Summary: 

      This manuscript reports the high-resolution cryo-EM structures of the endogenous TolC-YbjP-AcrABZ complex and a TolC-YbjP subcomplex from E. coli, identifying a novel accessory subunit. This work is an impressive effort that provides valuable structural insights into this native complex. 

      Strengths: 

      (1) The study successfully determines the structure of the complete, endogenously purified complex, marking a significant achievement. 

      (2) The identification of a previously unknown accessory subunit is an important finding. 

      (3) The use of cryo-EM to resolve the complex, including potential post-translational modifications such as N-palmitoyl and S-diacylglycerol, is a notable highlight. 

      Weaknesses: 

      (1) Clarity and Interpretation: Several points need clarification. Additionally, the description of the sample preparation method, which is a key strength, is currently misplaced and should be introduced earlier. 

      Thank you for pointing this out. We will reorganize the text to introduce the sample preparation strategy earlier and clarify the points that may cause ambiguity.

      (2) Data Presentation: The manuscript would benefit significantly from improved figures. 

      We agree and will revise the figures to improve clarity, consistency, and readability. Additional schematic illustrations will also be included where appropriate.

      (3) Supporting Evidence: The inclusion of the protein purification profile as a supplementary figure is essential. Furthermore, a discussion comparing the endogenous AcrB structure to those obtained in other systems (e.g., liposomes) and commenting on observed lipid densities would strengthen the overall analysis. 

      We appreciate these suggestions. We will add the purification profile and expand the comparison between our endogenous AcrB structure and previously reported structures from reconstituted systems, including a more detailed discussion of lipid densities.

      Reviewer #3 (Public review): 

      Summary: 

      The manuscript "Structural mechanisms of pump assembly and drug transport in the AcrAB-TolC efflux system" by Ge et al. describes the identification of a previously uncharacterized lipoprotein, YbjP, as a novel partner of the well-studied Enterobacterial tripartite efflux pump AcrAB-TolC. The authors present cryo-electron microscopy structures of the TolC-YbjP subcomplex and the complete AcrABZ-TolC-YbjP assembly. While the identification and structural characterization of YbjP are potentially novel, the stated focus of the manuscript-mechanisms of pump assembly and drug transport - is not sufficiently addressed. The manuscript requires reframing to emphasize the principal novelty associated with YbjP and significant development of the other aspects, especially the claimed novelty of the AcrB drug-efflux cycle. 

      Strengths: 

      The reported association of YbjP with AcrAB-TolC is novel; however, a recent deposition of a preceding and much more detailed manuscript to the BioRxiv server (Horne et al., https://doi.org/10.1101/2025.03.19.644130) removes much of the immediate novelty. 

      Weaknesses: 

      While the identification of YbjP is novel, the authors do not appear to acknowledge the precedence of another work (Horne et al., 2025), and it is not cited within the correct context in the manuscript. 

      We thank the reviewer for rasising this important point regarding the independent nature of our work.

      Our study indeed progressed independently. The process began with our purification of an endogenous protein sample containing the AcrAB-TolC efflux pump. During our cryo-EM analysis, we observed an unassigned density in the map, for which we built a preliminary main-chain model. A subsequent search of structural databases, including AlphaFold predictions, allowed us to identify this density as the protein YbjP. It was only after this identification that we became aware of the related preprint by Horne et al. on BioRxvi (Posted March 19, 2025).

      Therefore, our structural determination of YbjP was conducted entirely independently. We fully acknowledge and respect the work by Horne et al. and have already cited their reprint in our manuscript. While their detailed structural data, maps, and coordinates are not yet publicly available, we have described their findings appropriately. We agree that our manuscript can better reflect this context and will carefully check for any missing citations to ensure that their contribution is properly and clearly acknowledged.

      We also believe that the two studies are mutually complementary and collectively reinforce the emerging understanding of YbjP.

      Several results presented in the TolC-YbjP section do not represent new findings regarding TolC structure itself.

      We agree that the TolC features we describe are consistent with previously reported structural characteristics. However, these observations could only be confirmed in the context of the newly determined TolC–YbjP subcomplex, which was not available prior to this study. We will clarify this point in the revision to avoid overstating novelty.

      The structure and gating behaviour of TolC should be more thoroughly introduced in the Introduction, including prior work describing channel opening and conformational transitions.

      We appreciate this suggestion and agree that a more comprehensive overview of TolC gating and conformational transitions will strengthen the Introduction. We will revise the text to incorporate relevant prior structural and functional studies.

      The current manuscript does not discuss the mechanistic role of helices H3/H4 and H7/H8 in channel dilation, despite implying that YbjP binding may influence these features.

      Thank you for this comment. The primary novel contributions of this manuscript are the identification of YbjP and the structural characterization of AcrB in three distinct states. The discussion of the dilation mechanism, while included because we observed the closed TolC-YbjP state, is a secondary point. In the revised manuscript, we will expand this discussion as suggested.

      Only the original closed TolC structure is cited, and the manuscript does not address prior mutational studies involving the D396 region, though this residue is specifically highlighted in the presented structures. 

      We appreciate the reviewer drawing attention to this oversight. We will add citations to the relevant mutational and mechanistic studies, including those involving the D396 region, and more clearly discuss these findings in relation to our structural observations.

      The manuscript provides only a general structural alignment between the closed TolC-YbjP subcomplex and the open TolC observed in the full pump assembly. However, multiple open, closed, and intermediate conformations of AcrAB-TolC have already been reported. Thus, YbjP alone cannot be assumed to account for TolC channel gating. A systematic comparison with existing structures is necessary to determine whether YbjP contributes any distinct allosteric modulation. 

      We agree with the reviewer’s assessment and appreciate the constructive suggestion. In our revised manuscript, we will expand the structural comparison to include previously reported open, closed, and intermediate AcrAB–TolC conformations. This expanded analysis will more clearly position our findings within the existing structural framework.

      The analysis of AcrB peristaltic action is superficial, poorly substantiated and importantly, not novel. Several references to the ATP-synthase cycle have been provided, but this has been widely established already some 20 years ago - e.g. https://www.science.org/doi/10.1126/science.1131542

      We thank the reviewer for this comment. We fully acknowledge the foundational studies that established the AcrB functional cycle and its analogy to the ATP-synthase mechanism. While previous work indeed defined the LTO (Loose, Tight, Open) cycle of AcrB, those structures were obtained using AcrB in isolation. In contrast, our endogenous sample, which includes the native constraints of AcrA from above and the presence of AcrZ, reveals conformational changes in the transmembrane and porter domains that differ from those previously reported. We interpret these differences as reflecting a more physiologically relevant mechanism. In our revision, we will provide a detailed discussion to contextualize these distinctions within the existing literature.

      The most significant limitation of the study is the absence of functional characterization of YbjP in vivo or in vitro. While the structural association between YbjP and TolC is interesting, the biological role of YbjP remains unclear.

      We agree that the lack of functional characterization is a limitation of the present work. Our study focuses on structural elucidation and structural analysis. Although the recent preprint you mentioned suggests that YbjP deletion may not produce a strong phenotype, we are still interested in conducting additional experiments to explore its potential roles in future work. We will revise the text to clearly acknowledge this limitation.

      Moreover, the manuscript does not examine structural differences between the presented complex and previously solved AcrAB-TolC or MexAB-OprM assemblies that might support a mechanistic model.

      We thank the reviewer for this suggestion. We will incorporate a more detailed comparative analysis with existing AcrAB–TolC and MexAB–OprM structures and highlight similarities and differences that may inform mechanistic interpretation.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript by Lu and colleagues demonstrates convincingly that PRRT2 interacts with brain voltage-gated sodium channels to enhance slow inactivation in vitro and in vivo. The work is interesting and rigorously conducted. The relevance to normal physiology and disease pathophysiology (e.g., PRRT2-related genetic neurodevelopmental disorders) seems high. Some simple additional experiments could elevate the impact and make the study more complete.

      Strengths:

      Experiments are conducted rigorously, including experimenter blinding and appropriate controls. Data presentation is excellent and logical. The paper is well written for a general scientific audience.

      Weaknesses:

      There are a few missing experiments and one place where data are over-interpreted.

      (1) An in vitro study of Nav1.6 is conspicuously absent. In addition to being a major brain Na channel, Nav1.6 is predominant in cerebellar Purkinje neurons, which the authors note lack PRRT2 expression. They speculate that the absence of PRRT2 in these neurons facilitates the high firing rate. This hypothesis would be strengthened if PRRT2 also enhanced slow inactivation of Nav1.6. If a stable Nav1.6 cell were not available, then simple transient co-transfection experiments would suffice.

      We thank the reviewer for this suggestion. In the revised manuscript, we will examine whether PRRT2 modulates slow inactivation of Nav1.6 channels using heterologous co-expression experiments.

      (2) To further demonstrate the physiological impact of enhanced slow inactivation, the authors should consider a simple experiment in the stable cell line experiments (Figure 1) to test pulse frequency dependence of peak Na current. One would predict that PRRT2 expression will potentiate 'run down' of the channels, and this finding would be complementary to the biophysical data.

      We agree that examining pulse frequency-dependent changes in peak sodium current would provide a functional readout linking PRRT2-mediated enhancement of slow inactivation to use-dependent channel availability. In the revision, we will include a pulse-train protocol to quantify use-dependent attenuation (“run-down”) of peak sodium current across stimulation trains and will compare this adaptation between control and PRRT2-expressing conditions.

      (3) The study of one K channel is limited, and the conclusion from these experiments represents an over-interpretation. I suggest removing these data unless many more K channels (ideally with measurable proxies for slow inactivation) were tested. These data do not contribute much to the story.

      We agree with the reviewer’s assessment. To avoid over-interpretation and to maintain focus on PRRT2-dependent regulation of Nav channel slow inactivation, we will remove potassium channel dataset and the associated conclusions from the revised manuscript.

      (4) In Figure 2, the authors should confirm that protein is indeed expressed in cells expressing each truncated PRRT2 construct. Absent expression should be ruled out as an explanation for the enhancement of slow inactivation.

      We appreciate the reviewer’s concern regarding expression of the truncated PRRT2 constructs in the Nav1.2 stable cell line, particularly PRRT2(1-266), which shows little effect on slow inactivation of Nav1.2 channels. In the revision, we will include expression controls for each truncation construct in the Nav1.2-expressing cells to rule out lack of expression as an explanation for the observed functional differences.

      Reviewer #2 (Public review):

      Summary:

      As a member of DspB subfamily, PRRT2 is primarily expressed in the nervous system and has been associated with various paroxysmal neurological disorders. Previous studies have shown that PRRT2 directly interacts with Nav1.2 and Nav1.6, modulating channel properties and neuronal excitability.

      In this study, Lu et al. reported that PRRT2 is a physiological regulator of Nav channel slow inactivation, promoting the development of Nav slow inactivation and impeding the recovery from slow inactivation. This effect can be replicated by the C-terminal region (256-346) of PRRT2, and is highly conserved across species from zebrafish, mouse, to human PRRT2. TRARG1 and TMEM233, the other two DspB family members, showed similar effects on Nav1.2 slow inactivation. Co-IP data confirms the interaction between Nav channels and PRRT2. Prrt2-mutant mice, which lack PRRT2 expression, require lower stimulation thresholds for evoking after-discharges when compared to WT mice.

      Strengths:

      (1) This study is well designed, and data support the conclusion that PRRT2 is a potent regulator of slow inactivation of Nav channels.

      (2) This study reveals similar effects on Nav1.2 slow inactivation by PRRT2, TMEM233, and TRARG1, indicating a common regulation of Nav channels by DspB family members (Supplemental Figure 2). A recent study has shown that TMEM233 is essential for ExTxA (a plant toxin)-mediated inhibition on fast inactivation of Nav channels; and PRRT2 and TRARG1 could replicate this effect (Jami S, et al. Nat Commun 2023). It is possible that all three DspB members regulate Nav channel properties through the same mechanism, and exploring molecules that target PRRT2/TRARG1/TMEM233 might be a novel strategy for developing new treatments of DspB-related neurological diseases.

      Weaknesses:

      (1) Previously, the authors have reported that PRRT2 reduces Nav1.2 current density and alters biophysical properties of both Nav1.2 and Nav1.6 channels, including enhanced steady-state inactivation, slower recovery, and stronger use-dependent inhibition (Lu B, et al. Cell Rep 2021, Fig 3 & S5). All those changes are expected to alter neuronal excitability and should be discussed.

      We agree that PRRT2 has been reported to exert multiple effects on Nav channels which are all expected to influence neuronal excitability (Fruscione et al., 2018; Lu et al., 2021; Valente et al., 2023). In the revised manuscript, we will expand the Discussion to integrate these prior findings and to clarify how these PRRT2-dependent changes may interact with (and potentially converge on) modulation of slow inactivation to shape neuronal excitability.

      (2) In this study, the fast inactivation kinetics was examined by a single stimulus at 0 mV, which may not be sufficient for the conclusion. Inactivation kinetics at more voltage potentials should be added.

      We thank the reviewer for this suggestion. In the revision, we will extend our analysis of Nav1.2 fast-inactivation kinetics across a range of test potentials (e.g., -20, -10, 0, +10 and +20 mV) in the presence and absence of PRRT2.

      (3) It is a little surprising that there is no difference in Nav1.2 current density in axon-blebs between WT and Prrt2-mutant mice (Figure 7B). PRRT2 significantly shifts steady-state slow inactivation curve to hyperpolarizing direction, at -70 mV, nearly 70% of Nav1.2 channels are inactivated by slow inactivation in cells expressing PRRT2 when compared to less than 10% in cells expressing GFP (Figure supplement 1B); with a holding potential of -70 mV, I would expect that most of Nav channels are inactivated in axon-blebs from WT mice but not in axon-blebs from Prrt2-mutant mice, and therefore sodium current density should be different in Figure 7B, which was not. Any explanation?

      We appreciate the reviewer for raising this point. In our axonal bleb recordings, although the holding potential was -70 mV, sodium current density was measured after a hyperpolarizing pre-pulse (-110 mV) to relieve inactivation immediately prior to the test depolarization (as described in the Methods). Thus, the current density measurement in Figure 7B reflects the maximal available current following this recovery step, rather than the steady-state availability at -70 mV. In the revision, we will state this explicitly in the Results and/or figure legend to avoid confusion.

      (4) Besides Nav channels, PRRT2 has been shown to act on Cav2.1 channels as well as molecules involved in neurotransmitter release, which may also contribute to abnormal neuronal activity in Prrt2-mutant mice. These should be mentioned when discussing PRRT2's role in neuronal resilience.

      We agree with the reviewer. In the revised manuscript, we will broaden the Discussion to acknowledge PRRT2 functions beyond Nav channels, including reported roles in Cav2.1 regulation and neurotransmitter release. We will frame the in vivo phenotypes in Prrt2-mutant mice as likely arising from convergent mechanisms—altered intrinsic excitability together with changes in synaptic transmission.

      Reviewer #3 (Public review):

      This paper reveals that the neuronal protein PRRT2, previously known for its association with paroxysmal dyskinesia and infantile seizures, modulates the slow inactivation of voltage-gated sodium ion (Nav) channels, a gating process that limits excitability during prolonged activity. Using electrophysiology, molecular biology, and mouse models, the authors show that PRRT2 accelerates entry of Nav channels into the slow-inactivated state and slows their recovery, effectively dampening excessive excitability. The effect seems evolutionarily conserved, requires the C-terminal region of PRRT2, and is recapitulated in cortical neurons, where PRRT2 deficiency leads to hyper-responsiveness and reduced cortical resilience in vivo. These findings extend the functional repertoire of PRRT2, identifying it as a physiological brake on neuronal excitability. The work provides a mechanistic link between PRRT2 mutations and episodic neurological phenotypes.

      Comments:

      (1) The precise structural interface and the molecular basis of gating modulation remain inferred rather than demonstrated.

      We thank the reviewer for this comment. In the revision, we will make it explicit that our structural modeling are based on prediction rather than evidential. We will also expand the Limitations section to highlight that direct structural and biochemical mapping of the PRRT2-Nav interface (e.g., through targeted mutagenesis, crosslinking, and/or structural determination) will be required to define the binding interface and establish the molecular basis of gating modulation.

      (2) The in vivo phenotype reflects a complex circuit outcome and does not isolate slow-inactivation defects per se.

      We agree with the reviewer. In the revision, we will refine the Discussion to avoid over-attributing the in vivo phenotype to slow-inactivation defects alone and to explicitly state that impaired slow inactivation in Prrt2-mutant mice represents one plausible contributing mechanism to reduced cortical resilience, alongside other PRRT2-dependent process.

      (3) Expression of PRRT2 in muscle or heart is low, so the cross-isoform claims are likely of limited physiological significance.

      We thank the review for your comment about physiological relevance. In the revised manuscript, we will clarify that our Nav isoform panel was designed to assess mechanistic generality at the channel level rather than to imply broad in vivo relevance across tissues. We will also expand the Discussion to emphasize that any therapeutic strategy involving PRRT2 delivery should consider its consistent effect on slow inactivation across multiple Nav isoforms.

      (4) The mechanistic separation between the trafficking effect of PRRT2 and its gating effects is not clearly resolved.

      We appreciate the reviewer for raising this important point. In the revision, we will expand the Discussion to clarify why we interpret the effect of PRRT2 on slow inactivation as a gating modulation rather than a secondary consequence of altered channel abundance or localization. First, our slow inactivation measurements are expressed as the fraction of available channels after depolarization conditioning relative to baseline availability within the same cell (post-/pre-conditioning), which minimizes confounding by differences in initial surface expression. Second, the slow inactivation of Nav channel occurs on a rapid, activity-dependent timescale (seconds), whereas remarkable changes in trafficking and surface abundance generally develop over longer intervals (minutes to hours).

      (5) Additional studies with Nav1.6 should be carried out.

      We thank the reviewer’s suggestion. We will include Nav1.6 slow inactivation experiments in the revised manuscript.

    1. Author response:

      eLife Assessment

      This important study fills a major geographic and temporal gap in understanding Paleocene mammal evolution in Asia and proposes an intriguing "brawn before bite" hypothesis grounded in diverse analytical approaches. However, the findings are incomplete because limitations in sampling design - such as the use of worn or damaged teeth, the pooling of different tooth positions, and the lack of independence among teeth from the same individuals - introduce uncertainties that weaken support for the reported disparity patterns. The taxonomic focus on predominantly herbivorous clades also narrows the ecological scope of the results. Clarifying methodological choices, expanding the ecological context, and tempering evolutionary interpretations would substantially strengthen the study.

      We thank Dr. Rasmann for the constructive evaluation of our manuscript. Considering the reviewers’ comments, we plan to implement revisions to our study focusing on (1) expansion of the fossil sample description, including a detailed account of the process of excluding extremely worn or damaged teeth from all analyses, (2) expanded reporting of the analyses done on individual tooth positions, and tempering the interpretation of the pooled samples in light of the issues raised by reviewers, (3) providing a more comprehensive introduction that includes an overview of the Paleocene mammal faunas in south China, which unevenly samples certain clades whereas others are extremely rare, and why the current available fossil samples would not permit a whole-fauna analysis to be adequately conducted across the three land mammal age time bins of the Paleocene in China. We believe these revisions would substantially strengthen the study’s robustness and impact for understanding the ecomorphological evolution of the earliest abundant placental mammals during the Paleocene in Asia.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work provides valuable new insights into the Paleocene Asian mammal recovery and diversification dynamics during the first ten million years post-dinosaur extinction. Studies that have examined the mammalian recovery and diversification post-dinosaur extinction have primarily focused on the North American mammal fossil record, and it's unclear if patterns documented in North America are characteristic of global patterns. This study examines dietary metrics of Paleocene Asian mammals and found that there is a body size disparity increase before dietary niche expansion and that dietary metrics track climatic and paleobotanical trends of Asia during the first 10 million years after the dinosaur extinction.

      Strengths:

      The Asian Paleocene mammal fossil record is greatly understudied, and this work begins to fill important gaps. In particular, the use of interdisciplinary data (i.e., climatic and paleobotanical) is really interesting in conjunction with observed dietary metric trends.

      Weaknesses:

      While this work has the potential to be exciting and contribute greatly to our understanding of mammalian evolution during the first 10 million years post-dinosaur extinction, the major weakness is in the dental topographic analysis (DTA) dataset.

      There are several specimens in Figure 1 that have broken cusps, deep wear facets, and general abrasion. Thus, any values generated from DTA are not accurate and cannot be used to support their claims. Furthermore, the authors analyze all tooth positions at once, which makes this study seem comprehensive (200 individual teeth), but it's unclear what sort of noise this introduces to the study. Typically, DTA studies will analyze a singular tooth position (e.g., Pampush et al. 2018 Biol. J. Linn. Soc.), allowing for more meaningful comparisons and an understanding of what value differences mean. Even so, the dataset consists of only 48 specimens. This means that even if all the specimens were pristinely preserved and generated DTA values could be trusted, it's still only 48 specimens (representing 4 different clades) to capture patterns across 10 million years. For example, the authors note that their results show an increase in OPCR and DNE values from the middle to the late Paleocene in pantodonts. However, if a singular tooth position is analyzed, such as the lower second molar, the middle and late Paleocene partitions are only represented by a singular specimen each. With a sample size this small, it's unlikely that the authors are capturing real trends, which makes the claims of this study highly questionable.

      We thank Reviewer 1 for their careful review of our manuscript. A major external limitation of the application of DTA to fossil samples is the availability of specimens. Whereas a typical study design using extant or geologically younger/more abundant fossil species would preferably sample much larger quantities of teeth from each treatment group (time bins, in our case), the rarity of well-preserved Paleocene mammalian dentitions in Asia necessitates the analysis of small samples in order to make observations regarding major trends in a region and time period otherwise impossible to study (see Chow et al. 1977). That said, we plan to clarify methodological details in response to the reviewer’s comments, including a more comprehensive explanation of our criteria for exclusion of broken tooth crowns from the analyses. We also plan to expand our results reporting on individual tooth position analysis, potentially including resampling and/or simulation analyses to assess the effect of small and uneven samples on our interpretation of results. Lastly, we plan to revise the discussion and conclusion accordingly, including more explicit distinction between well-supported findings that emerge from various planned sensitivity analyses, versus those that are more speculative and tentative in nature.

      Chow, M., Zhang, Y., Wang, B., and Ding, S. (1977). Paleocene mammalian fauna from the Nanxiong Basin, Guangdong Province. Paleontol. Sin. New Ser. C 20, 1–100.

      Reviewer #2 (Public review):

      Summary:

      This study uses dental traits of a large sample of Chinese mammals to track evolutionary patterns through the Paleocene. It presents and argues for a 'brawn before bite' hypothesis - mammals increased in body size disparity before evolving more specialized or adapted dentitions. The study makes use of an impressive array of analyses, including dental topographic, finite element, and integration analyses, which help to provide a unique insight into mammalian evolutionary patterns.

      Strengths:

      This paper helps to fill in a major gap in our knowledge of Paleocene mammal patterns in Asia, which is especially important because of the diversification of placentals at that time. The total sample of teeth is impressive and required considerable effort for scanning and analyzing. And there is a wealth of results for DTA, FEA, and integration analyses. Further, some of the results are especially interesting, such as the novel 'brawn before bite' hypothesis and the possible link between shifts in dental traits and arid environments in the Late Paleocene. Overall, I enjoyed reading the paper, and I think the results will be of interest to a broad audience.

      Weaknesses:

      I have four major concerns with the study, especially related to the sampling of teeth and taxa, that I discuss in more detail below. Due to these issues, I believe that the study is incomplete in its support of the 'brawn before bite' hypothesis. Although my concerns are significant, many of them can be addressed with some simple updates/revisions to analyses or text, and I try to provide constructive advice throughout my review.

      (1) If I understand correctly, teeth of different tooth positions (e.g., premolars and molars), and those from the same specimen, are lumped into the same analyses. And unless I missed it, no justification is given for these methodological choices (besides testing for differences in proportions of tooth positions per time bin; L902). I think this creates some major statistical concerns. For example, DTA values for premolars and molars aren't directly comparable (I don't think?) because they have different functions (e.g., greater grinding function for molars). My recommendation is to perform different disparity-through-time analyses for each tooth position, assuming the sample sizes are big enough per time bin. Or, if the authors maintain their current methods/results, they should provide justification in the main text for that choice.

      We thank Reviewer 2 for raising several issues worthy of clarification. Separate analyses for individual tooth positions were performed but not emphasized in the first version of the study. In our revised version we plan to highlight the nuances of the results from premolar versus molar partition analyses.

      Also, I think lumping teeth from the same specimen into your analyses creates a major statistical concern because the observations aren't independent. In other words, the teeth of the same individual should have relatively similar DTA values, which can greatly bias your results. This is essentially the same issue as phylogenetic non-independence, but taken to a much greater extreme.

      It seems like it'd be much more appropriate to perform specimen-level analyses (e.g., Wilson 2013) or species-level analyses (e.g., Grossnickle & Newham 2016) and report those results in the main text. If the authors believe that their methods are justified, then they should explain this in the text.

      We plan to emphasize individual tooth position analyses in our revisions, and provide a stronger justification for our current treatment of multiple teeth from the same individual specimens as independent samples. We recognize the statistical nonindependence raised by Reviewer 2, but we would point out that from an ecomorphological perspective, it is unclear to us that the heterodont dentition of these early Cenozoic placental mammals should represent a single ecological signal (and thus warrant using only a single tooth position as representative of an individual’s DTA values). We plan to closely examine the nature of nonindependence in the DTA data within individuals, to assess a balanced approach to maximize information content from the relatively small and rare fossil samples used, while minimizing signal nonindependence across the dentition.

      (2) Maybe I misunderstood, but it sounds like the sampling is almost exclusively clades that are primarily herbivorous/omnivorous (Pantodonta, Arctostylopida, Anagalida, and maybe Tillodonta), which means that the full ecomorphological diversity of the time bins is not being sampled (e.g., insectivores aren't fully sampled). Similarly, the authors say that they "focused sampling" on those major clades and "Additional data were collected on other clades ... opportunistically" (L628). If they favored sampling of specific clades, then doesn't that also bias their results?

      If the study is primarily focused on a few herbivorous clades, then the Introduction should be reframed to reflect this. You could explain that you're specifically tracking herbivore patterns after the K-Pg.

      We plan to revise the introduction section to more accurately reflect the emphasis on those clades. However, we would note that conventional dietary ecomorphology categories used to characterize later branching placental mammals are likely to be less informative when applied to their Paleocene counterparts. Although there are dental morphological traits that began to characterize major placental clades during the Paleocene, distinctive dietary ecologies have not been demonstrated for most of the clade representatives studied. Thus, insectivory was probably not restricted to “Insectivora”, nor carnivory to early Carnivmorpha or “Creodonta”, each of which represented less than 5% of the taxonomic richness during the Paleocene in China (Wang et al. 2007).

      Wang, Y., Meng, J., Ni, X., and Li, C. (2007). Major events of Paleogene mammal radiation in China. Geol. J. 42, 415–430.

      (3) There are a lot of topics lacking background information, which makes the paper challenging to read for non-experts. Maybe the authors are hindered by a short word limit. But if they can expand their main text, then I strongly recommend the following:

      (a) The authors should discuss diets. Much of the data are diet correlates (DTA values), but diets are almost never mentioned, except in the Methods. For example, the authors say: "An overall shift towards increased dental topographic trait magnitudes ..." (L137). Does that mean there was a shift toward increased herbivory? If so, why not mention the dietary shift? And if most of the sampled taxa are herbivores (see above comment), then shouldn't herbivory be a focal point of the paper?

      We plan to revise the text to make clearer connections between DTA and dietary inferences, and at the same time advise caution in making one-to-one linkages between them. Broadly speaking, dental indices such as DTA are phenotypic traits, and as in other phenotypic traits, the strength of structure-function relationships needs to be explicitly established before dietary ecological inferences can be confidently made. There is, to date, no consistent connection between dental topology and tooth use proxies and biomechanical traits in extant non-herbivorous species (e.g., DeSantis et al. 2017, Tseng and DeSantis 2024), and in our analyses, FEA and DTA generally did not show strong correlations to each other. Thus, we plan to continue to exercise care in interpreting DTA data as dietary data.

      DeSantis LRG, Tseng ZJ, Liu J, Hurst A, Schubert BW, Jiangzuo Q. Assessing niche conservatism using a multiproxy approach: dietary ecology of extinct and extant spotted hyenas. Paleobiology. 2017;43(2):286-303. doi:10.1017/pab.2016.45

      Tseng ZJ, DeSantis LR. Relationship between tooth macrowear and jaw morphofunctional traits in representative hypercarnivores. PeerJ. 2024 Nov 11;12:e18435.

      (b) The authors should expand on "we used dentitions as ecological indicators" (L75). For non-experts, how/why are dentitions linked to ecology? And, again, why not mention diet? A strong link between tooth shape and diet is a critical assumption here (and one I'm sure that all mammalogists agree with), but the authors don't provide justification (at least in the Introduction) for that assumption. Many relevant papers cited later in the Methods could be cited in the Introduction (e.g., Evans et al. 2007).

      Thank you for this suggestion. We plan to expand the introduction section to better contextualize the methodological basis for the work presented.

      (c) Include a better introduction of the sample, such as explicitly stating that your sample only includes placentals (assuming that's the case) and is focused on three major clades. Are non-placentals like multituberculates or stem placentals/eutherians found at Chinese Paleocene fossil localities and not sampled in the study, or are they absent in the sampled area?

      We thank Reviewer 2 for raising this important point worthy of clarification. Multituberculates are completely absent from the first two land mammal ages in the Paleocene of Asia, and non-placentals are rare in general (Wang et al. 2007). We plan to provide more context for the taxonomic sampling choices made in the study.

      Wang, Y., Meng, J., Ni, X., and Li, C. (2007). Major events of Paleogene mammal radiation in China. Geol. J. 42, 415–430.

      (d) The way in which "integration" is being used should be defined. That is a loaded term which has been defined in different ways. I also recommend providing more explanation on the integration analyses and what the results mean.

      If the authors don't have space to expand the main text, then they should at least expand on the topics in the supplement, with appropriate citations to the supplement in the main text.

      We plan to clarify our usage of “integration” to enable readers to accurately interpret what we mean by it.

      (4) Finally, I'm not convinced that the results fully support the 'brawn before bite' hypothesis. I like the hypothesis. However, the 'brawn before ...' part of the hypothesis assumes that body size disparity (L63) increased first, and I don't think that pattern is ever shown. First, body size disparity is never reported or plotted (at least that I could find) - the authors just show the violin plots of the body sizes (Figures 1B, S6A). Second, the authors don't show evidence of an actual increase in body size disparity. Instead, they seem to assume that there was a rapid diversification in the earliest Paleocene, and thus the early Paleocene bin has already "reached maximum saturation" (L148). But what if the body size disparity in the latest Cretaceous was the same as that in the Paleocene? (Although that's unlikely, note that papers like Clauset & Redner 2009 and Grossnickle & Newham 2016 found evidence of greater body size disparity in the latest Cretaceous than is commonly recognized.) Similarly, what if body size disparity increased rapidly in the Eocene? Wouldn't that suggest a 'BITE before brawn' hypothesis? So, without showing when an increase in body size diversity occurred, I don't think that the authors can make a strong argument for 'brawn before [insert any trait]".

      Although it's probably well beyond the scope of the study to add Cretaceous or Eocene data, the authors could at least review literature on body size patterns during those times to provide greater evidence for an earliest Paleocene increase in size disparity.

      We plan to provide a broader discussion and any supporting evidence from the Cretaceous and Eocene to either make a stronger case for “brawn before bite”, or to refine what we mean by brawn/size/size disparity.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This Review Article explores the intricate relationship between humans and Mycobacterium tuberculosis (Mtb), providing an additional perspective on TB disease. Specifically, this review focuses on the utilization of systems-level approaches to study TB, while highlighting challenges in the frameworks used to identify the relevant immunologic signals that may explain the clinical spectrum of disease. The work could be further enhanced by better defining key terms that anchor the review, such as "unified mechanism" and "immunological route." This review will be of interest to immunologists as well as those interested in evolution and host-pathogen interactions.

      We thank the editors for reviewing our article and for the primarily positive comments. We accept that better definition and terminology will improve the clarity of the message, and so have changed the wording as suggested above in the revised manuscript.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This is an interesting and useful review highlighting the complex pathways through which pulmonary colonisation or infection with Mycobacterium tuberculosis (Mtb) may progress to develop symptomatic disease and transmit the pathogen. I found the section on immune correlates associated with individuals who have clearly been exposed to and reacted to Mtb but did not develop latent infections particularly valuable. However, several aspects would benefit from clarification.

      Strengths:

      The main strengths lie in the arguments presented for a multiplicity of immune pathways to TB disease.

      Weaknesses:

      The main weaknesses lie in clarity, particularly in the precise meanings of the three figures.

      We accept this point, and have completely changed figure 2, and have expanded the legends for figure 1 and 3 to maximise clarity.

      I accept that there is a 'goldilocks zone' that underpins the majority of TB cases we see and predominantly reflects different patterns of immune response, but the analogies used need to be more clearly thought through.

      We are glad the reviewer agrees with the fundamental argument of different patterns of immunity, and have revised the manuscript throughout where we feel the analogies could be clarified.

      Reviewer #2 (Public review):

      Summary:

      This is a thought-provoking perspective by Reichmann et al, outlining supportive evidence that Mycobacterium tuberculosis co-evolved with its host Homo Sapiens to both increase susceptibility to infection and reduce rates of fatal disease through decreased virulence. TB is an ancient disease where two modes of virulence are likely to have evolved through different stages of human evolution: one before the Neolithic Demographic Transition, where humans lived in sparse hunter-gatherer communities, which likely selected for prolonged Mtb infection with reduced virulence to allow for transmission across sparse populations. Conversely, following the agricultural and industrial revolutions, Mtb virulence is likely to have evolved to attack a higher number of susceptible individuals. These different disease modalities highlight the central idea that there are different immunological routes to TB disease, which converge on a disease phenotype characterized by high bacterial load and destruction of the extracellular matrix. The writing is very clear and provides a lot of supportive evidence from population studies and the recent clinical trials of novel TB vaccines, like M72 and H56. However, there are areas to support the thesis that have been described only in broad strokes, including the impact of host and Mtb genetic heterogeneity on this selection, and the alternative model that there are likely different TB diseases (as opposed to different routes to the same disease), as described by several groups advancing the concept of heterogeneous TB endotypes. I expand on specific points below.

      Strengths:

      The idea that Mtb evolved to both increase transmission (and possible commensalism with humans) with low rates of reactivation is intriguing. The heterogeneous TB phenotypes in the collaborative cross model (PMID: 35112666) support this idea, where some genetic backgrounds can tolerate a high bacterial load with minimal pathology, while others show signs of pathogenesis with low bacterial loads. This supports the idea that the underlying host state, driven by a number of factors like genetics and nutrition, is likely to explain whether someone will co-exist with Mtb without pathology, or progress to disease. I particularly enjoyed the discussion of the protective advantages provided by Mtb infection, which may have rewired the human immune system to provide protection against heterologous pathogens- this is supported by recent studies showing that Mtb infection provides moderate protection against SARS-CoV-2 (PMID: 35325013, and 37720210), and may have applied to other viruses that are likely to have played a more significant role in the past in the natural selection of Homo Sapiens.

      We thank the reviewer for their positive comments, and also for pointing out work that we have overlooked citing previously. We now discuss and cite the work above as suggested

      Modeling from Marcel Behr and colleagues (PMID: 31649096) indeed suggests that there are at least TB clinical phenotypes that likely mirror the two distinct phases of Mtb co-evolution with humans. Most of the TB disease progression occurs rapidly (within 1-2 years of exposure), and the rest are slow cases of reactivation over time. I enjoyed the discussion of the difference between the types of immune hits needed to progress to disease in the two scenarios, where you may need severe immune hits for rapid progression, a phenotype that likely evolved after the Neolithic transition to larger human populations. On the other hand, a series of milder immune events leading to reactivation after a long period of asymptomatic infection likely mirrors slow progression in the hunter-gatherer communities, to allow for prolonged transmission in scarce populations. Perhaps a clearer analysis of these models would be helpful for the reader.

      We agree that we did not present these concepts in as much detail as we should, and so we now discuss this more on lines 81 – 83 and 184 - 187)

      Weaknesses:

      The discussion of genetic heterogeneity is limited and only discusses evidence from MSMD studies. Genetics is an important angle to consider in the co-evolution of Mtb and humans. There is a large body of literature on both host and Mtb genetic associations with TB disease. The very fact that host variants in one population do not necessarily cross-validate across populations is evidence in support of population-specific adaptations. Specific Mtb lineages are likely to have co-evolved with distinct human populations. A key reference is missing (PMID: 23995134), which shows that different lineages co-evolved with human migrations. Also, meta-analyses of human GWAS studies to define variants associated with TB are very relevant to the topic of co-evolution (e.g., PMID: 38224499). eQTL studies can also highlight genetic variants associated with regulating key immune genes involved in the response to TB. The authors do mention that Mtb itself is relatively clonal with ~2K SNPs marking Mtb variation, much of which has likely evolved under the selection pressure of modern antibiotics. However, some of this limited universe of variants can still explain co-adaptations between distinct Mtb lineages and different human populations, as shown recently in the co-evolution of lineage 2 with a variant common in Peruvians (PMID: 39613754).

      We thank the reviewer for these comments and agree we failed to cite and discuss the work from Sebastian Gagneux’s group on co-migration, which we now discuss. We include a new paragraph discussing co-evolution as suggested on lines 145 – 155 and 218 -220 , citing the work proposed, which we agree enhances the arguments about co-evolution.

      Although the examples of anti-TNF and anti-PD1 treatments are relevant as drivers of TB in limited clinical contexts, the bigger picture is that they highlight major distinct disease endotypes. These restricted examples show that TB can be driven by immune deficiency (as in the case of anti-TNF, HIV, and malnutrition) or hyperactivation (as in the case of anti-PD1 treatment), but there are still certainly many other routes leading to immune suppression or hyperactivation. Considering the idea of hyper-activation as a TB driver, the apparent higher rate of recurrence in the H56 trial referenced in the review is likely due to immune hyperactivation, especially in the context of residual bacteria in the lung. These different TB manifestations (immune suppression vs immune hyperactivation) mirror TB endotypes described by DiNardo et al (PMID: 35169026) from analysis of extensive transcriptomic data, which indicate that it's not merely different routes leading to the same final endpoint of clinical disease, but rather multiple different disease endpoints. A similar scenario is shown in the transcriptomic signatures underlying disease progression in BCG-vaccinated infants, where two distinct clusters mirrored the hyperactivation and immune suppression phenotypes (PMID: 27183822). A discussion of how to think about translating the extensive information from system biology into treatment stratification approaches, or adjunct host-directed therapies, would be helpful.

      We agree with the points made and that the two publications above further enhance the paper. We have added discussion of the different disease endpoints on line 65 - 67, the evidence regarding immune herpeactivation versus suppression in the vaccination study on lines 162 - 164, and expanded on the translational implications on lines 349 – 352.

      Reviewer #3 (Public review):

      Summary:

      This perspective article by Reichmann et al. highlights the importance of moving beyond the search for a single, unified immune mechanism to explain host-Mtb interactions. Drawing from studies in immune profiling, host and bacterial genetics, the authors emphasize inconsistencies in the literature and argue for broader, more integrative models. Overall, the article is thought-provoking and well-articulated, raising a concept that is worth further exploration in the TB field.

      Strengths:

      Timely and relevant in the context of the rapidly expanding multi-omics datasets that provide unprecedented insights into host-Mtb interactions.

      Weaknesses (Minor):

      Clarity on the notion of a "unified mechanism". It remains unclear whether prior studies explicitly proposed a single unifying immunological model. While inconsistencies in findings exist, they do not necessarily demonstrate that earlier work was uniformly "single-minded". Moreover, heterogeneity in TB has been recognized previously (PMIDs: 19855401, 28736436), which the authors could acknowledge.

      We accept this point and have toned down the language, acknowledging that we are expanding on an argument that others have made, whilst focusing on the implications for the systems immunology era, and cite the previous work as suggested.

      Evolutionary timeline and industrial-era framing. The evolutionary model is outdated. Ancient DNA studies place the Mtb's most recent common ancestor at ~6,000 years BP (PMIDs: 25141181; 25848958). The Industrial Revolution is cited as a driver of TB expansion, but this remains speculative without bacterial-genomics evidence and should be framed as a hypothesis. Additionally, the claim that Mtb genomes have been conserved only since the Industrial Revolution (lines 165-167) is inaccurate; conservation extends back to the MRCA (PMID: 31448322).

      Our understanding is that the evolutionary timeline is not fully resolved, with conflicting evidence proposing different dates. The ancient DNA studies giving a timeline of 6,000 years seem to oppose the evidence of evidence of Mtb infection of humans in the middle east 10,000 years ago, and other estimates suggesting 70,000 years. Therefore, we have cited the work above and added a sentence highlighting that different studies propose different timelines. We would propose the industrial revolution created the ideal societal conditions for the expansion of TB, and this would seem widely accepted in the field, but have added a proviso as suggested. We did not intent to claim that Mtb genomes have been conserved since the industrial revolution, the point we were making is that despite rapid expansion within human populations, it has still remained conserved. We therefore have revised our discussion of the conservation of the Mtb genomes on lines and 72 – 74, 81 – 83 and 185 – 190.

      Trained immunity and TB infection. The treatment of trained immunity is incomplete. While BCG vaccination is known to induce trained immunity (ref 59), revaccination does not provide sustained protection (ref 8), and importantly, Mtb infection itself can also impart trained immunity (PMID: 33125891). Including these nuances would strengthen the discussion.

      We have refined this section. We did cite PMID: 33125891 in the original submission but have changed the wording to emphasise the point on line …

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Abstract

      Line 30: What is an immunological route? Suggest

      ”...host-pathogen interaction, with diverse immunological processes leading to TB disease (10%) or stable lifelong association or elimination. We suggest these alternate relationships result from the prolonged co-evolution of the pathogen with humans and may even confer a survival advantage in the 90% of exposures that do not progress to disease.”

      Thank you, we have reworded the abstract along the lines suggested above, but not identically to allow for other reviewer comments.

      Introduction

      Ln 43: It is misleading to suggest that the study of TB was the leading influence in establishing the Koch's postulates framework. Many other infections were involved, and Jacob Henle, one of Koch's teachers, is credited with the first clear formulation (see Evans AS. 1976 THE YALE JOURNAL OF BIOLOGY AND MEDICIN PMID: 782050).

      We have downplayed the language, stating that TB “contributed” to the formulation if Koch’s postulated.

      Ln 46: While the review rightly emphasises intracellular infection in macrophages, the importance and abundance of extracellular bacilli should not be ignored, particularly in transmission and in cavities.

      We agree, and have added text on the importance of extracellular bacteria and transmission.

      Ln: 56: This is misleading as primary disease prevention is implied, whereas the vaccine was given to individuals presumed to be already infected (TST or IGRA positive). Suggest ..."reduces by 50% progression to overt TB disease when given to those with immunological evidence of latent infection.

      Thank you, edit made as suggested

      Ln 62: Not sure why it is urgent. Suggest "high priority".

      Wording changed as suggested.

      Figure 1 needs clarification. The colour scale appears to signify the strength or vigour of the immune response so that disease is associated with high (orange/red) or low (green/blue) activity. The arrows seem to imply either a sequence or a route map when all we really have is an association with a plausible mechanistic link. They might also be taken to imply a hierarchy that is not appropriate. I'm not sure that the X-rays and arrows add anything, and the rectangle provides the key information on its own. Clarify please.

      We have clarified the figure legend. We feel the X-rays give the clinical context, and so have kept them, and now state in the legend that this is highlighting that there are diverse pathways leading to active disease to try to emphasise the point the figure is illustrating.

      Ln 149-157: I agree that the current dogma is that overt pulmonary disease is required to spread Mtb and fuel disease prevalence. It is vitally important to distinguish the spread of the organism from the occurrence of disease (which does not, of itself, spread). However, both epidemiological (e.g. Ryckman TS, et al. 2022Proc Natl Acad Sci U S A:10.1073/pnas.2211045119) and recent mechanistic (Dinkele R, et al. 2024iScience:10.1016/j.isci.2024.110731, Patterson B, et al. 2024Proc Natl Acad Sci U S A:10. E1073/pnas.2314813121, Warner DF, et al. 2025Nat Rev Microbiol:10.1038/s41579-025-01201-x) studies indicate the importance of asymptomatic infections, and those associated with sputum positivity have recently been recognised by WHO. I think it will be important to acknowledge the importance of this aspect and consider how immune responses may or may not contribute. I regard the view that Mtb is an obligate pathogen, dependent on overt pTB for transmission, as needing to be reviewed.

      We agree that we did not give sufficient emphasis to the emerging evidence on asymptomatic infections, and that this may play an important part in transmission in high incidence settings. We now include a discussion on this, and citation of the papers above, on lines 168 – 170.

      Ln 159: The terms colonise and colonisation are used, without a clear definition, several times. My view is that both refer to the establishment and replication of an organism on or within a host without associated damage. Where there is associated damage, this is often mediated by immune responses. In this header, I think "establishment in humanity" would be appropriate.

      We agree with this point and have changed the header as suggested, and clarified our meaning when we use the term colonisation, which the reviewer correctly interprets.

      Ln 181-: I strongly support the view that Mtb has contributed to human selection, even to the suggestion that humanity is adapted to maintain a long-term relationship with Mtb

      Thank you, and we have expanded on this evidence as suggested by other reviewers.

      Ln 189: improved.

      Apologies, typo corrected.

      Figure 2: I was also confused by this. The x-axis does not make sense, as a single property should increase. Moreover, does incidence refer to incidence in individuals with that specific balance of resistance and susceptibility, or contribution to overall global incidence - I suspect the latter (also, prevalence would make more sense). At the same time, the legend implies that those with high resistance to colonisation will be infrequent in the population, suggesting that the Y axis should be labelled "frequency in human population". Finally, I can't see what single label could apply to the X axis. While the implication that the majority of global infections reflect a balance between the resistance and susceptibilities is indicated, a frequency distribution does not seem an appropriate representation.

      The reviewer is correct that the X axis is aiming to represent two variables, which is not logical, and so we have completely changed this figure to a simple one that we hope makes the point clearly and have amended the legend appropriately. We are aiming to highlight the selective pressures of Mtb on the human population over millennia.

      Ln 244: Immunological failure - I agree with the statement but again find the figure (3) unhelpful. Do we start or end in the middle? Is the disease the outside - if so, why are different locations implied? The notion of a maze has some value, but the bacteria should start and finish in the same place by different routes.

      We are attempting to illustrate the concept that escape from host immunological control can occur through different mechanisms. As this comment was just from one reviewer, we have left the figure unchanged but have expanded the legend to try to make the point that this is just a conceptual illustration of multiple routes to disease.

      Ln 262 onward: I broadly agree with the points made about omic technologies, but would wish to see major emphasis on clear phenotyping of cases. There is something of a contradiction in the review between the emphasis on the multiplicity of immunological processes leading ultimately to disease and the recommendation to analyse via omics, which, in their most widely applied format, bundle these complexities into analyses of the humoral and cellular samples available in blood. Admittedly, the authors point out opportunities for 3-dimensional and single-cell analyses, but it is difficult to see where these end without extrapolation ad infinitum.

      We totally agree that clear phenotyping of infection is critical, and expand on this further on lines 307 - 309.

      Reviewer #2 (Recommendations for the authors):

      I suggest expanding on the genetic determinants of Mtb/host co-evolution.

      Thank you, we have now expanded on these sections as suggested.

      Reviewer #3 (Recommendations for the authors):

      We are in an era of exploding large-scale datasets from multi-omics profiling of Mtb and host interactions, offering an unprecedented lens to understand the complexity of the host immune response to Mtb-a pathogen that has infected human populations for thousands of years. The guiding philosophy for how to interpret this tremendous volume of data and what models can be built from it will be critical. In this context, the perspective article by Reichmann et al. raises an interesting concept: to "avoid unified immune mechanisms" when attempting to understand the immunology underpinning host-Mtb interactions. To support their arguments, the authors review studies and provide evidence from immune profiling, host and bacterial genetics, and showcase several inconsistencies. Overall, this perspective article is well articulated, and the concept is worthwhile for further exploration. A few comments for consideration:

      Clarity on the notion of a "unified mechanism". Was there ever a single, clearly proposed unified immunological mechanism? For example, in lines 64-65, the authors criticize that almost all investigations into immune responses to Mtb are based on the premise that a unifying disease mechanism exists. However, after reading the article, it was not clear to me how previous studies attempted to unify the model or what that unifying mechanism was. While inconsistencies in findings certainly exist, they do not necessarily indicate that prior work was guided by a unified framework. I agree that interpreting and exploring data from a broader perspective is valuable, but I am not fully convinced that previous studies were uniformly "single-minded". In fact, the concept of heterogeneity in TB has been previously discussed (e.g., PMIDs: 19855401, 28736436).

      We accept this point, and that we have overstated the argument and not acknowledged previous work sufficiently. We now downplay the language and cite the work as proposed.

      However, we would propose that essentially all published studies imply that single mechanisms underly development of disease. The authors are not aware of any manuscript that concludes “Therefore, xxxx pathway is one of several that can lead to TB disease”, instead they state “Therefore, xxxx pathway leads to TB disease”. The implication of this language is that the mechanism described occurs in all patients, whilst in fact it likely only is involved in a subset. We have toned down the language and expand on this concept on line 268 – 270.

      Evolutionary timeline and industrial-era framing. The evolutionary model needs updating. The manuscript cites a "70,000-year" origin for Mtb, but ancient-DNA studies place the most recent common ancestor at ~6,000 years BP (PMIDs: 25141181; 25848958). The Industrial Revolution is invoked multiple times as a driver of TB expansion, yet the magnitude of its contribution remains debated and, to my knowledge, lacks direct bacterial-genomics evidence for causal attribution; this should be framed as a hypothesis rather than a conclusion. In addition, the statement in lines 165-167 is inaccurate: at the genome level, Mtb has remained highly conserved since its most recent common ancestor-not specifically since the Industrial Revolution (PMID: 31448322).

      We accept these points and have made the suggested amendments, as outlined in the public responses. Our understanding is that the evidence about the most common ancestor is controversial; if the divergence of human populations occurred concurrently with Mtb, then this must have been significantly earlier than 6,000 years ago, and so there are conflicting arguments in this domain.

      Trained immunity and TB infection. The discussion of trained immunity could be expanded. Reference 59 suggests the induction of innate immune training, but reference 8 reports that revaccination does not confer protection against sustained TB infection, indicating that at least "re"-vaccination may not enhance protection. Furthermore, while BCG is often highlighted as a prototypical inducer of trained immunity, real-world infection occurs through Mtb itself. Importantly, a later study demonstrated that Mtb infection can also impart trained immunity (PMID: 33125891). Integrating these findings would provide a more nuanced view of how both vaccination and infection shape innate immune training in the TB context.

      We thank the reviewer for these suggestions and have edited the relevant section to include these studies.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public reviews:

      Reviewer #1 (Public review):

      In this important study, the authors characterized the transformation of neural representations of olfactory stimuli from the primary sensory cortex to multisensory regions in the medial temporal lobe and investigated how they were affected by non-associative learning. The authors used high-density silicon probe recordings from five different cortical regions while familiar vs. novel odors were presented to a head-restrained mouse. This is a timely study because unlike other sensory systems (e.g., vision), the progressive transformation of olfactory information is still poorly understood. The authors report that both odor identity and experience are encoded by all of these five cortical areas but nonetheless some themes emerge. Single neuron tuning of odor identity is broad in the sensory cortices but becomes narrowly tuned in hippocampal regions. Furthermore, while experience affects neuronal response magnitudes in early sensory cortices, it changes the proportion of active neurons in hippocampal regions. Thus, this study is an important step forward in the ongoing quest to understand how olfactory information is progressively transformed along the olfactory pathway.

      The study is well-executed. The direct comparison of neuronal representations from five different brain regions is impressive. Conclusions are based on single neuronal level as well as population level decoding analyses. Among all the reported results, one stands out for being remarkably robust. The authors show that the anterior olfactory nucleus (AON), which receives direct input from the olfactory bulb output neurons, was far superior at decoding odor identity as well as novelty compared to all the other brain regions. This is perhaps surprising because the other primary sensory region - the piriform cortex - has been thought to be the canonical site for representing odor identity. A vast majority of studies have focused on aPCx, but direct comparisons between odor coding in the AON and aPCx are rare. The experimental design of this current study allowed the authors to do so and the AON was found to convincingly outperform aPCx. Although this result goes against the canonical model, it is consistent with a few recent studies including one that predicted this outcome based on anatomical and functional comparisons between the AON-projecting tufted cells vs. the aPCx-projecting mitral cells in the olfactory bulb (Chae, Banerjee et. al. 2022). Future experiments are needed to probe the circuit mechanisms that generate this important difference between the two primary olfactory cortices as well as their potential causal roles in odor identification.

      The authors were also interested in how familiarity vs. novelty affects neuronal representation across all these brain regions. One weakness of this study is that neuronal responses were not measured during the process of habituation. Neuronal responses were measured after four days of daily exposure to a few odors (familiar) and then some other novel odors were introduced. This creates a confound because the novel vs. familiar stimuli are different odorants and that itself can lead to drastic differences in evoked neural responses. Although the authors try to rule out this confound by doing a clever decoding and Euclidian distance analysis, an alternate more straightforward strategy would have been to measure neuronal activity for each odorant during the process of habituation.

      Reviewer #2 (Public review):

      This manuscript investigates how olfactory representations are transformed along the cortico-hippocampal pathway in mice during a non-associative learning paradigm involving novel and familiar odors. By recording single-unit activity in several key brain regions (AON, aPCx, LEC, CA1, and SUB), the authors aim to elucidate how stimulus identity and experience are encoded and how these representations change across the pathway.

      The study addresses an important question in sensory neuroscience regarding the interplay between sensory processing and signaling novelty/familiarity. It provides insights into how the brain processes and retains sensory experiences, suggesting that the earlier stations in the olfactory pathway, the AON aPCx, play a central role in detecting novelty and encoding odor, while areas deeper into the pathway (LEC, CA1 & Sub) are more sparse and encodes odor identity but not novelty/familiarity. However, there are several concerns related to methodology, data interpretation, and the strength of the conclusions drawn.

      Strengths:

      The authors combine the use of modern tools to obtain high-density recordings from large populations of neurons at different stages of the olfactory system (although mostly one region at a time) with elegant data analyses to study an important and interesting question.

      Weaknesses:

      (1) The first and biggest problem I have with this paper is that it is very confusing, and the results seem to be all over the place. In some parts, it seems like the AON and aPCx are more sensitive to novelty; in others, it seems the other way around. I find their metrics confusing and unconvincing. For example, the example cells in Figure 1C show an AON neuron with a very low spontaneous firing rate and a CA1 with a much higher firing rate, but the opposite is true in Figure 2A. So, what are we to make of Figure 2C that shows the difference in firing rates between novel vs. familiar odors measured as a difference in spikes/sec. This seems nearly meaningless. The authors could have used a difference in Z-scored responses to normalize different baseline activity levels. (This is just one example of a problem with the methodology.)

      We appreciate the reviewer’s concerns regarding clarity and methodology. It is less clear why all neurons in a given brain area should have similar firing rates. Anatomically defined brain areas typically comprise of multiple cell types, which can have diverse baseline firing rates. Since we computed absolute firing rate differences per neuron (i.e., novel vs. familiar odor responses within the same neuron), baseline differences across neurons do not have a major impact.

      The suggestion to use Z-scores instead of absolute firing rate differences is well taken. However, Z-scoring assumes that the underlying data are normally distributed, which is not the case in our dataset. Specifically, when analyzing odor-evoked firing rates on a per-neuron basis, only 4% of neurons exhibit a normal distribution. In cases of skewed distributions, Z-scoring can distort the data by exaggerating small variations, leading to misleading conclusions. We acknowledge that different analysis methods exist, we believe that our chosen approach best reflects the properties of the dataset and avoids potential misinterpretations introduced by inappropriate normalization techniques.

      (2) There are a lot of high-level data analyses (e.g., decoding, analyzing decoding errors, calculating mutual information, calculating distances in state space, etc.) but very little neural data (except for Figure 2C, and see my comment above about how this is flawed). So, if responses to novel vs. familiar odors are different in the AON and aPCx, how are they different? Why is decoding accuracy better for novel odors in CA1 but better for familiar odors in SUB (Figure 3A)? The authors identify a small subset of neurons that have unusually high weights in the SVM analyses that contribute to decoding novelty, but they don't tell us which neurons these are and how they are responding differently to novel vs. familiar odors.

      We performed additional analyses to address the reviewer’s feedback (Figures 2C-E and lines 118-132) and added more single-neuron data (Figures 1, S3 and S4).

      (3) The authors call AON and aPCx "primary sensory cortices" and LEC, CA1, and Sub "multisensory areas". This is a straw man argument. For example, we now know that PCx encodes multimodal signals (Poo et al. 2021, Federman et al., 2024; Kehl et al., 2024), and LEC receives direct OB inputs, which has traditionally been the criterion for being considered a "primary olfactory cortical area". So, this terminology is outdated and wrong, and although it suits the authors' needs here in drawing distinctions, it is simplistic and not helpful moving forward.

      We appreciate the reviewer’s concern regarding the classification of brain regions as “primary sensory” versus “multisensory.” Of note, the cited studies (Poo et al., 2021; Federman et al., 2024; Kehl et al., 2024) focus on posterior PCx (pPCx), while our recordings were conducted in very anterior section of anterior PCx. The aPCx and pPCx have distinct patterns of connectivity, both anatomically and functionally. To the best of our knowledge, there is no evidence for multimodal responses in aPCx, whereas there is for LEC, CA1 and SUB. Furthermore, our distinction is not based on a connectivity argument, as the reviewer suggests, but on differences in the α-Poisson ratio (Figure 1E and F).

      To avoid confusion due to definitions of what constitutes a “primary sensory” region, we adopted a more neutral description throughout the manuscript.

      (4) Why not simply report z-scored firing rates for all neurons as a function of trial number? (e.g., Jacobson & Friedrich, 2018). Figure 2C is not sufficient.

      Regarding z-scores, please see response to 1). We further added a figure showing responses of all neurons to novel stimuli (using ROC instead of z-scoring, as described previously (e.g. Cohen et al. Nature 2012). We added the following figure to the supplementary for the completeness of the analysis (S2E).

      For example, in the Discussion, they say, "novel stimuli caused larger increases in firing rates than familiar stimuli" (L. 270), but what does this mean?

      This means that on average, the population of neurons exhibit higher firing rates in response to novel odors compared to familiar ones.

      Odors typically increase the firing in some neurons and suppress firing in others. Where does the delta come from? Is this because novel odors more strongly activate neurons that increase their firing or because familiar odors more strongly suppress neurons?

      We thank the reviewer for this valuable feedback and extended the characterization of firing rate properties, including a separate analysis of neurons i) significantly excited by odorants, ii) significantly inhibited by odorants and iii) not responsive to odorants. We added the analysis and corresponding discussion to the main manuscript (Figures 2C-E and lines 118-132)

      (5) Lines 122-124 - If cells in AON and aPCx responded the same way to novel and familiar odors, then we would say that they only encode for odor and not at all for experience. So, I don't understand why the authors say these areas code for a "mixed representation of chemical identity and experience." "On the other hand," if LEC, CA1, and SUB are odor selective and only encode novel odors, then these areas, not AON and aPCx, are the jointly encoding chemical identity and experience. Also, I do not understand why, here, they say that AON and PCx respond to both while LEC, CA1, and SUB were selective for novel stimuli, but the authors then go on to argue that novelty is encoded in the AON and PCx, but not in the LEC, CA1, and SUB.

      We appreciate the reviewer’s request for clarification. Throughout the brain areas we studied, odorant identity and experience can be decoded. However, the way information is represented is different between regions. We acknowledge that that “mixed” representation is a misleading term and removed it from the manuscript.

      In AON and aPCx, neurons significantly respond to both novel and familiar odors. However, the magnitude of their responses to novel and familiar odors is sufficiently distinct to allow for decoding of odor experience (i.e., whether an odor is novel or familiar). Moreover, novelty engages more neurons in encoding the stimulus (Figure 2D). In neural space, the position of an odor’s representation in AON and aPCx shifts depending on whether it is novel or familiar, meaning that experience modifies the neural representation of odor identity. This suggests that in these regions the two representations are intertwined.

      In contrast, some neurons in LEC, CA1, and SUB exhibit responses to novel odors, but few neurons respond to familiar odors at all. This suggests a more selective encoding of novelty.

      (6) Lines 132-140 - As presented in the text and the figure, this section is poorly written and confusing. Their use of the word "shuffled" is a major source of this confusion, because this typically is the control that produces outcomes at the chance level. More importantly, they did the wrong analysis here. The better and, I think, the only way to do this analysis correctly is to train on some of the odors and test on an untrained odor (i.e., what Bernardi et al., 2021 called "cross-condition generalization performance"; CCGP).

      We appreciate the feedback and thank the reviewer for the recommendation to implement cross-condition generalization performance (CCGP) as used in Bernardi et al., 2020. We acknowledge that the term "shuffled" may have caused confusion, as it typically refers to control analyses producing chance-level outcomes. In our case, by "shuffling" we shuffled the identity of novel and familiar odors to assess how much the decoder relies on odor identity when distinguishing novelty. This test provided insight into how novelty-based structure exists within neural activity beyond random grouping but does not directly assess generalization.

      As suggested, we used CCGP to measure how well novelty-related representations generalize across different odors. Our findings show that in AON and aPCx, novelty-related information is indeed highly generalizable, supporting the idea that these regions encode novelty in a less odor-selective manner (Figure 2K).

      Reviewer #3 (Public review):

      In this manuscript, the authors investigate how odor-evoked neural activity is modulated by experience within the olfactory-hippocampal network. The authors perform extracellular recordings in the anterior olfactory nucleus (AON), the anterior piriform (aPCx) and lateral entorhinal cortex (LEC), the hippocampus (CA1), and the subiculum (SUB), in naïve mice and in mice repeatedly exposed to the same odorants. They determine the response properties of individual neurons and use population decoding analyses to assess the effect of experience on odor information coding across these regions.

      The authors' findings show that odor identity is represented in all recorded areas, but that the response magnitude and selectivity of neurons are differentially modulated by experience across the olfactory-hippocampal pathway.

      Overall, this work represents a valuable multi-region data set of odor-evoked neural activity. However, limitations in the interpretability of odor experience of the behavioral paradigm, and limitations in experimental design and analysis, restrict the conclusions that can be drawn from this study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Some suggestions, in no particular order, to further improve the manuscript:

      (1) The example neuronal responses for CA1 and SUB in Figure 1 are not very inspiring. To my eyes, the odor period response is not that different from the baseline period. In general, a thorough characterization of firing rate properties during the odor period between the different brain regions would be informative.

      We thank the reviewer for this valuable feedback. We have replaced the example neurons from CA1 and SUB in Figure 1C. We further extended the characterization of firing rate properties, including a separate analysis of neurons i) significantly excited by odorants, ii) significantly inhibited by odorants and iii) not responsive to odorants. We added the analysis and corresponding discussion to the main manuscript (Figures 2C-E and lines 118-132)

      (2) For the summary in Figure 1, why not show neuronal responses as z-scored firing rates as opposed to auROC?

      We chose to use auROC instead of z-scored firing rates due to the non-normality of the dataset, which can distort results when using z-scores. Specifically, z-scoring can exaggerate small deviations in neurons with low responsiveness, potentially leading to misleading conclusions. auROC provides a more robust measure of response change that is less sensitive to these distortions because it does not assume any specific distribution. This approach has been used previously (e.g. Cohen et al. 2012, Nature).

      (3) To study novelty, the authors presented odorants that were not used during four days of habituation. But this design makes it hard to dissociate odor identity from novelty. Why not track the response of the same odorants during the habituation process itself?

      We respectfully disagree with the argument that using different stimuli as novel and familiar constitutes a confound in our analysis. In our study, we used multiple different, structurally dissimilar single molecule chemicals which were randomly assigned to novel and familiar categories in each animal. If individual stimuli did cause “drastic differences in evoked neural responses”, these would be evenly distributed between novel and familiar stimuli. It is therefore extremely unlikely that the clear differences we observed between novel and familiar conditions and between brain areas can be attributed to the contribution of individual stimuli, in particular given our analyses was performed at the population level. In fact, we observed that responses between novel and familiar conditions were qualitatively very similar in the short time window after odor onset (Figure 1G and H).

      Importantly, the goal of this study was to investigate the impact of long-term habituation over more than 4 days, rather than short term habituation during one behavioral session. However, tracking the activity of large numbers of neurons across multiple days presents a significant technical challenge, due to the difficulty of identifying stable single-unit recordings over extended periods of time with sufficient certainty. Tools that facilitate tracking have recently been developed (e.g. Yuan AX et al., Elife. 2024) and it will be interesting to apply them to our dataset in the future.

      (4) Since novel odors lead to greater sniffing and sniffing strongly influences firing rates in the olfactory system, the authors decided to focus on a 400 ms window with similar sniffing rates for both novel vs. familiar odors. Although I understand the rationale for this choice, I worry that this is too restrictive, and it may not capture the full extent of the phenomenology.

      Could the authors model the effect of sniffing on firing rates of individual neurons from the data, and then check whether the odor response for novel context can be fully explained just by increased sniffing or not?

      It is an interesting suggestion to extend the window of analysis and observe how responses evolve with sniffing (and other behavioral reactions). To address this, we added an additional figure to the supplementary material, showing the mean responses of all neurons to novel stimuli during the entire odor presentation window (Fig. S1B).

      As suggested, we further created a Generalized Linear Model (GLM) for the entire 2s odor stimulation period, incorporating sniffing and novelty as independent variables. As expected, sniffing had a dominant impact on firing rate in all brain areas. A smaller proportion of neurons was modulated by novelty or by the interaction between novelty x breathing, suggesting the entrainment of neural activity by sniffing during the response to novel odors. These results support our decision to focus the analysis on the early 400ms window in order to dissociate the effects of novelty and behavioral responses. Taken together, our results suggest that odorant responses are modulated by novelty early during odorant processing, whereas at later stages sniffing becomes the predominant factor driving firing (Figure S2C-D).

      (5) The authors conclude that aPCx has a subset of neurons dedicated to familiar odors based on the distribution of SVM weights in Figure 3D. To me, this is the weakest conclusion of the paper because although significant, the effect size is paltry; the central tendencies are hardly different for the two conditions in aPCx. Could the authors show the PSTHs of some of these neurons to make this point more convincing?

      We appreciate the reviewer’s concern regarding the effect size. To strengthen our conclusion, we now include PSTHs of representative neurons in the least 10% and best 10% of neuronal population based on the SVM analysis (Figures S3 and S4). We hope this provides more clarity and support for the interpretation that there is a subset of neurons in aPCx that show greater sensitivity to familiar odors, despite the relatively modest central tendency differences.

      In the revised manuscript, we discuss the effect size more explicitly in the text to provide context for its significance (lines 193 - 195).

      Reviewer #2 (Recommendations for the authors):

      (1) The authors only talk about "responsive" neurons. Does this include neurons whose activity increases significantly (activated) and neurons whose activity decreases (suppressed)?

      Yes, the term "responsive" refers to neurons whose activity either increases significantly (excited) or decreases (inhibited) in response to the odor stimuli. We performed additional analyses to characterize responses separately for the different groups (Figure 2C-E and lines 118-132).

      (2) Line 54 - The Schoonover paper doesn't show that cells lose their responses to odors, but rather that the population of cells that respond to odors changes with time. That is, population responses don't become more sparse

      The fact that “the population of cells that respond to odors changes with time”, implies that some neurons lose their responsiveness (e.g. unit 2 in Figure 1 of Schoonover et al., 2021), while others become responsive (e.g. unit 1 in Figure 1 of Schoonover et al., 2021). Frequent responses reduce drift rate (Figure 4 of Schoonover et al., 2021), thus fewer neurons loose or gain responsiveness. We have revised the manuscript to clarify this.

      (3) Line 104 - "Recurrent" is incorrectly used here. I think the authors mean "repeated" or something more like that.

      Thank you for pointing this out. We replaced "recurrent" with "repeated".

      (4) Figure 3D - What is the scale bar here?

      We apologize for the accidental omission. The scale bar was be added to Figure 3D in the revised version of the manuscript.

      (5) Line 377 - They say they lowered their electrodes to "200 um/s per second." This must be incorrect. Is this just a typo, or is it really 200 um/s, because that's really fast?

      Thank you for pointing this out. It was 20 to 60 um/s, the change has been made in the manuscript.

      (6) Line 431: The authors say they used auROC to calculate changes in firing rates (which I think is only shown in Figure 1D). Note that auROC measures the discriminability of two distributions, not the strength or change in the strength of response.

      Indeed we used auROC to measure the discriminability of firing between baseline and during stimulus response. We have corrected the wording in the methods.

      (7) Figure 1B: The anatomical locations of the five areas they recorded from are straightforward, and this figure is not hugely helpful. However, the reader would benefit tremendously by including an experimental schematic. As is, we needed to scour the text and methods sections to understand exactly what they did when.

      We thank the reviewer for this suggestion. We included an experimental schematic in the supplementary material.

      (8) Figure 1F(left): This plot is much less useful without showing a pre-odor window, even if only times after the odor onset were used for calculation alpha

      We appreciate this concern, however the goal of Figure 1F is to illustrate the meaning of the alpha value itself. We chose not to include a pre-odor window comparison to avoid confusing the reader.

      (9) Figure 2A: What are the bar plots above the raster plots? Are these firing rates? Are the bars overlaid or stacked? Where is the y-axis scale bar?

      The bar plots above the raster plots represent a histogram of the spike count/trials over time, with a bin width of 50 ms. These bars are overlaid on the raster plot. We will include a y-axis scale bar in the revised figure to clarify the presentation.

      (10) Figure 4G: This makes no sense. First, the Y axis is supposed to measure standard deviation, but the axis label is spikes/s. Second, if responses in the AON are much less reliable than responses in "deeper" areas, why is odor decoding in AON so much better than in the other areas?

      We acknowledge the error in the axis label, and we will correct it to indicate the correct units. AON has a larger response variability but also larger responses magnitudes, which can explain the higher decoding accuracy.

      (11) From the model and text, one predicts that the lifetime sparseness increases along the pathway. The authors should use this metric as well/instead of "odor selectivity" because of problems with arbitrary thresholding.

      We acknowledge that lifetime sparseness, often computed using lifetime kurtosis, can be an informative measure of selectivity. However, we believe it has limitations that make it less suitable for our analysis. One key issue is that lifetime sparseness does not account for the stability of responses across multiple presentations of the same stimulus. In contrast, our odor selectivity measure incorporates trial-to-trial variability by considering responses over 10 trials and assessing significance using a Wilcoxon test compared to baseline. While the choice of a p-value threshold (e.g., 0.05) is somewhat arbitrary, it is a widely accepted statistical convention. Additionally, lifetime sparseness does not account for excitatory and inhibitory responses. For example, if a neuron X is strongly inhibited by odor A, strongly excited by odor B, and unresponsive to odors C and D, lifetime sparseness would classify it as highly selective for odor B, without capturing its inhibitory selectivity for odor A. The lifetime sparseness will be higher than if X was simply unresponsive for A.

      Our odor selectivity measure addresses this by considering both excitation and inhibition as potential responses. Thus, while lifetime sparseness could provide a useful complementary perspective in another type of dataset, it does not fully capture the dynamics of odor selectivity here.

      Author response 1.

      Lifetime Kurtosis distribution per region.

      Reviewer #3 (Recommendations for the authors):

      Main points:

      (1) The authors use a non-associative learning paradigm - repeated odor exposure - to test how experience modulates odor responses along the olfactory-hippocampal pathway. While repeated odor exposure clearly modulates odor-evoked neural activity, the relevance of this modulation and its differential effect across different brain areas are difficult to assess in the absence of any behavioral read-outs.

      Our experimental paradigm involves a robust, reliable behavioral readout of non-associative learning. Novel olfactory stimuli evoke a well-characterized orienting reaction, which includes a multitude of physiological reactions, including exploratory sniffing, facial movements and pupil dilation (Modirshanechi et al., Trends Neuroscience 2023). In our study, we focused on exploration sniffing.

      Compared to associative learning, non-associative learning might have received less attention. However, it is critically important because it forms the foundation for how organisms adapt to their environment through experience without forming associations. This is highlighted by the fact that non-instrumental stimuli can be remembered in large number (Standing, 1973) and with remarkable detail (Brady et al., 2008). While non-associative learning can thus create vast, implicit memory of stimuli in the environment, it is unclear how stimulus representations reflect this memory. Our study contributes to answering this question. We describe the impact of experience on olfactory sensory representations and reveal a transformation of representations from olfactory cortical to hippocampal structures. Our findings also indicate that sensory responses to familiar stimuli persist within sensory cortical and hippocampal regions, even after spontaneous orienting behaviors habituated. Further studies involving experimental manipulation techniques are needed to elucidate the causal mechanisms underlying the formation of stimulus memory during non-associative learning.

      (2) The authors discuss the olfactory-hippocampal pathway as a transition from primary sensory (AON, aPCx) to associative areas (LEC, CA1, SUB). While this is reasonable, given the known circuit connectivity, other interpretations are possible. For example, AON, aPCx, and LEC receive direct inputs from the olfactory bulb ('primary cortex'), while CA1 and SUB do not; AON receives direct top-down inputs from CA1 ('associative cortex'), while aPCx does not. In fact, the data presented in this manuscript does not appear to support a consistent, smooth transformation from sensory to associative, as implied by the authors (e.g. Figure 4A, F, and G).

      Thank you for this insightful comment. Indeed, there are complexities in the circuitry, and the relationships between different areas are not linear. We believe that AON and aPCx are distinctly different from LEC, CA1 and SUB, as the latter areas have been shown to integrate multimodal sensory information. To avoid confusion due to definitions of what constitutes a “primary sensory” region, we adopted a more neutral description throughout the manuscript. We also removed the term “gradual” to describe the transition of neural representations from olfactory cortical to hippocampal areas.

      (3) The analysis of odor-evoked responses is focused on a 400 ms window to exclude differences in sniffing behavior. This window spans 200 ms before and after the first inhalation after odor onset. Inhalation onset initiates neural odor responses - why do the authors include neural data before inhalation onset?

      The reason to include a brief time window prior to odor onset is to account for what is often called “partical” sniffs. In our experimental setup, odor delivery is not triggered by the animal’s inhalation. Therefore, it can happen that an animal has just begun to inhale when the stimulus is delivered. In this case, the animal is exposed to odorant molecules prior to the first complete inhalation after odor onset. We acknowledge that this limits the temporal resolution of our measurements, but it does not affect the comparison of sensory representations between different brain areas.

      It would also be interesting to explore the effect of sniffing behavior (see point 2) on odor-evoked neural activity.

      Thank you for your comment, we performed additional analysis including a GLM to address this question (Figure S2C-D).

      Minor points:

      (4) Figure 2A represents raster plots for 2 neurons per area - it is unclear how to distinguish between the 2 neurons in the plots.

      Figure 2A shows one example neuron per brain area. Each neurons has two raster plot which indicate responses to either a novel (orange) or a familiar stimulus (blue). We have revised the figure caption for clarity.

      (5) Overall, axes should be kept consistent and labeled in more detail. For example, Figure 2H and I are difficult to compare, given that the y-axis changes and that decoding accuracies are difficult to estimate without additional marks on the y-axis.

      Axes are indeed different, because chance level decoding accuracy is different between those two figures. The decoding between novel and familiar odors has a chance level of 0.5, while chance level decoding odors is 0.1 (there are 10 odors to decode the identity from).

      (6) Some parts of the discussion seem only loosely related to the data presented in this manuscript. For example, the statement that 'AON rather than aPCx should be considered as the primary sensory cortex in olfaction' seems out of context. Similarly, it would be helpful to provide data on the stability of subpopulations of neurons tuned to familiar odors, rather than simply speculate that they could be stable. The authors could summarize more speculative statements in an 'Ideas and Speculation' subsection.

      Thank you for your comment. We appreciate your perspective on our hypotheses. We have revised the discussion accordingly. Specifically, we removed the discussion of stable subpopulations, since we have not performed longitudinal tracking in this study.

      (7) The authors should try to reference relevant published work more comprehensively.

      Thank you for your comment. We attempted to include relevant published work without exceeding the limit for references but might have overseen important contributions. We apologize to our colleagues, whose relevant work might not have been cited.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The main contributions of this paper are: (1) a replication of the surprising prior finding that information about peripherally-presented stimuli can be decoded from foveal V1 (Williams et al 2008), (2) a new demonstration of cross-decoding between stimuli presented in the periphery and stimuli presented at the fovea, (3) a demonstration that the information present in the fovea is based on shape not semantic category, and (4) a demonstration that the strength of foveal information about peripheral targets is correlated with the univariate response in the same block in IPS.

      Strengths:

      The design and methods appear sound, and finding (2) above is new, and importantly constrains our understanding of this surprising phenomenon. The basic effect investigated here is so surprising that even though it has been replicated several times since it was first reported in 2008, it is useful to replicate it again.

      We thank the reviewer for their summary. While we agree with many points, we would like to respectfully push back on the notion that this work is a replication of Williams et al. (2008). What our findings share with those of Williams is a report of surprising decoding at the fovea without foveal stimulation. Beyond this similarity, we treat these as related but clearly separate findings, for the following reasons:

      (1) Foveal feedback, as shown by Williams et al. (2008) and others during fixation, was only observed during a shape discrimination task, specific to the presented stimulus. Control experiments without such a task (or a color-related task) did not show effects of foveal feedback. In contrast, in the present study, the participants’ task was merely to perform saccades towards stimuli, independently of target features. We thus show that foveal feedback can occur independently of a task related to stimulus features. This dissociation demonstrates that our study must be tapping into something different than reported by Williams.

      (2) In a related study, Kroell and Rolfs (2022, 2025) demonstrated a connection between foveal feedback and saccade preparation, including the temporal details of the onset of this effect before saccade execution, highlighting the close link of this effect to saccade preparation. Here we used a very similar behavioral task to capture this saccade-related effect in neural recordings and investigate how early it occurs and what its nature is. Thus, there is a clear motivation for this study in the context of eye movement preparation that is separate from the previous work by Williams.

      (3) Lastly, decoding in the experimental task was positively associated with activity in FEF and IPS, areas that have been reliably linked to saccade preparation. We have now also performed an additional analysis (see our response to Specific point 2 of Reviewer 2) showing that decoding in the control condition did not show the same association, further supporting the link of foveal feedback to saccade preparation. 

      Despite our emphasis on these critical differences in studies, covert peripheral attention, as required by the task in Williams et al., and saccade preparation in natural vision, as in our study, are tightly coupled processes. Indeed, the task in Williams et al. would, during natural vision, likely involve an eye movement to the peripheral target. While speculative, a parsimonious and ecologically valid explanation is that both ours and earlier studies involve eye movement preparation, for which execution is suppressed, however, in studies enforcing fixation (e.g., Williams et al., 2008). We now discuss this idea of a shared underlying mechanism more extensively in the revised manuscript (pg 8 ln 228-240). 

      Weaknesses:

      (1) The paper, including in the title ("Feedback of peripheral saccade targets to early foveal cortex") seems to assume that the feedback to foveal cortex occurs in conjunction with saccade preparation. However, participants in the original Williams et al (2008) paper never made saccades to the peripheral stimuli. So, saccade preparation is not necessary for this effect to occur. Some acknowledgement and discussion of this prior evidence against the interpretation of the effect as due to saccade preparation would be useful. (e.g., one might argue that saccade preparation is automatic when attending to peripheral stimuli.)

      We agree that the effects Williams et al. showed were not sufficiently discussed in the first version of this manuscript. To more clearly engage with these findings we now introduce saccade related foveal feedback (foveal prediction) and foveal feedback during fixation separately in the introduction (pg 2 ln 46-59).

      We further added another section in the discussion called “Foveal feedback during saccade preparation” in which we discuss how our findings are related to Williams et al. and how they differ (pg 8 ln 211-240). 

      As described in our previous response, we believe that our findings go beyond those described by Williams et al. (2008) and others in significant ways. However, during natural vision, the paradigm used by Williams et al. (2008) would likely be solved using an eye movement. Thus, while participants in Williams et al. (2008) did not execute saccades, it appears plausible that they have prepared saccades. Given the fact that covert peripheral attention and saccade preparation are tightly coupled processes (Kowler et al., 1995, Vis Res; Deubel & Schneider, 1996, Vis Res; Montagnini & Castet, 2007, J Vis; Rolfs & Carrasco, 2012, J Neurosci; Rolfs et al., 2011, Nat Neurosci), their results are parsimoniously explained by saccade preparation (but not execution) to a behaviorally relevant target.

      (2) The most important new finding from this paper is the cross-decodability between stimuli presented in the fovea and stimuli presented in the periphery. This finding should be related to the prior behavioral finding (Yu & Shim, 2016) that when a foveal foil stimulus identical to a peripheral target is presented 150 ms after the onset of the peripheral target, visual discrimination of the peripheral target is improved, and this congruency effect occurred even though participants did not consciously perceive the foveal stimulus (Yu, Q., & Shim, W. M., 2016). Modulating foveal representation can influence visual discrimination in the periphery (Journal of Vision, 16(3), 15-15).

      We thank the reviewer for highlighting this highly relevant reference. In the revised version of the manuscript, we now put more emphasis on the finding of cross-decodability (pg 2 ln 60-61). We now also discuss Yu et al.’s finding, which support our conclusion that foveal feedback and direct stimulus presentation share representational formats in early visual areas (pg 9 ln 277-279).

      (3) The prior literature should be laid out more clearly. For example, most readers will not realize that the basic effect of decodability of peripherally-presented stimuli in the fovea was first reported in 2008, and that that original paper already showed that the effect cannot arise from spillover effects from peripheral retinotopic cortex because it was not present in a retinotopic location between the cortical locus corresponding to the peripheral target and the fovea. (For example, this claim on lines 56-57 is not correct: "it remains unknown 1) whether information is fed back all the way to early visual areas".) What is needed is a clear presentation of the prior findings in one place in the introduction to the paper, followed by an articulation and motivation of the new questions addressed in this paper. If I were writing the paper, I would focus on the cross-decodability between foveal and peripheral stimuli, as I think that is the most revealing finding.

      We agree that the structure of the introduction did not sufficiently place our work in the context of prior literature. We have now expanded upon our Introduction section to discuss past studies of saccade- and fixation-related foveal feedback (pg 2 ln 49-59), laying out how this effect has been studied previously. We also removed the claim that "it remains unknown 1) whether information is fed back all the way to early visual areas", where our intention was to specifically focus on foveal prediction. We realize that this was not clear and hence removed this section. Instead, we now place a stronger focus on the cross-decodability finding (pg 2 ln 60-61).

      Reviewer #2 (Public review):

      Summary:

      This study investigated whether the identity of a peripheral saccade target object is predictively fed back to the foveal retinotopic cortex during saccade preparation, a critical prediction of the foveal prediction hypothesis proposed by Kroell & Rolfs (2022). To achieve this, the authors leveraged a gaze-contingent fMRI paradigm, where the peripheral saccade target was removed before the eyes landed near it, and used multivariate decoding analysis to quantify identity information in the foveal cortex. The results showed that the identity of the saccade target object can be decoded based on foveal cortex activity, despite the fovea never directly viewing the object, and that the foveal feedback representation was similar to passive viewing and not explained by spillover effects. Additionally, exploratory analysis suggested IPS as a candidate region mediating such foveal decodability. Overall, these findings provide neural evidence for the foveal cortex processing the features of the saccade target object, potentially supporting the maintenance of perceptual stability across saccadic eye movements.

      Strengths:

      This study is well-motivated by previous theoretical findings (Kroell & Rolfs, 2022), aiming to provide neural evidence for a potential neural mechanism of trans-saccadic perceptual stability. The question is important, and the gaze-contingent fMRI paradigm is a solid methodological choice for the research goal. The use of stimuli allowing orthogonal decoding of stimulus category vs stimulus shape is a nice strength, and the resulting distinctions in decoded information by brain region are clean. The results will be of interest to readers in the field, and they fill in some untested questions regarding pre-saccadic remapping and foveal feedback.

      We thank the reviewer for the positive assessment of our study.

      Weaknesses:

      The conclusions feel a bit over-reaching; some strong theoretical claims are not fully supported, and the framing of prior literature is currently too narrow. A critical weakness lies in the inability to test a distinction between these findings (claiming to demonstrate that "feedback during saccade preparation must underlie this effect") and foveal feedback previously found during passive fixation (Williams et al., 2008). Discussions (and perhaps control analysis/experiments) about how these findings are specific to the saccade target and the temporal constraints on these effects are lacking. The relationship between the concepts of foveal prediction, foveal feedback, and predictive remapping needs more thorough treatment. The choice to use only 4 stimuli is justified in the manuscript, but remains an important limitation. The IPS results are intriguing but could be strengthened by additional control analysis. Finally, the manuscript claims the study was pre-registered ("detailing the hypotheses, methodology, and planned analyses prior to data collection"), but on the OSF link provided, there is just a brief summary paragraph, and the website says "there have been no completed registrations of this project".

      We thank the reviewer for these helpful considerations. We agree that some of the claims were not sufficiently supported by the evidence, and in the revised manuscript, we added nuance to those claims (pg 8 ln 211-240). Furthermore, we now address more directly the distinction between foveal feedback during fixation and foveal feedback (foveal prediction) during saccade preparation. In particular, we now describe the literature about these two effects separately in the introduction (pg 2 ln 46-59), and we have added a new section in the discussion (“Foveal feedback during saccade preparation”) that more thoroughly explains why a passive fixation condition would have been unlikely to produce the same results we find (pg 8 ln 211-227). We also adapted the section about “Saccadic remapping or foveal prediction”, clearly delineating foveal prediction from feature remapping and predictive updating of attention pointers. As recommended by the reviewer, we conducted the parametric modulation analyses on the control condition, strengthening the claim that our findings are saccade-related. These results were added as Supplementary Figure 2 and are discussed in (pg 7 ln 190-191) and (pg 8 ln 224-227). 

      Lastly, we would like to apologize about a mistake we made with the pre-registration. We realized that the pre-registration had indeed not been submitted. We have now done so without changing the pre-registration itself, which can be seen from the recent activity of the preregistration (screenshot attached in the end). After consulting an open science expert at the University of Leipzig, we added a note of this mistake to the methods section of the revised manuscript (pg 10 ln 326-332). We could remove reference to this preregistration altogether, but would keep it at the discretion of the editor. 

      Specifics:

      (1) In the eccentricity-dependent decoding results (Figure 2B), are there any statistical tests to support the results being a U-shaped curve? The dip isn't especially pronounced. Is 4 degrees lower than the further ones? Are there alternative methods of quantifying this (e.g., fitting it to a linear and quadratic function)?

      We statistically tested the U-shaped relationship using a weighted quadratic regression, which showed significant positive curvature for decoding between fovea and periphery in all early visual areas (V1: t(27) = 3.98, p = 0.008, V2: t(27) = 3.03, p = 0.02, V3: t(27)= 2.776, p = 0.025, one-sided). We now report these results in the revised manuscript (pg 5 ln 137-138).

      (2) In the parametric modulation analysis, the evidence for IPS being the only region showing stronger fovea vs peripheral beta values was weak, especially given the exploratory nature of this analysis. The raw beta value can reflect other things, such as global brain fluctuations or signal-to-noise ratio. I would also want to see the results of the same analysis performed on the control condition decoding results.

      We appreciate the reviewer’s suggestion and repeated the same parametric modulation analysis on the control condition to assess the influence of potential confounds on the overall beta values (Supplementary Figure 2). The results show a negative association between foveal decoding and FEF and IPS (likely because eye movements in the control condition lead to less foveal presentation of the stimulus) and a positive association with LO. Peripheral decoding was not associated with significant changes in any of the ROIs, indicating that global brain fluctuations alone are not responsible for the effects reported in the experimental condition. The results of this analysis thus show a specific positive association of IPS activity with the experimental condition, not the control condition, which is in line with the idea that the foveal feedback effect reported in this study may be related to saccade preparation.

      (3) Many of the claims feel overstated. There is an emphasis throughout the manuscript (including claims in the abstract) that these findings demonstrate foveal prediction, specifically that "image-specific feedback during saccade preparation must underlie this effect." To my understanding, one of the key aspects of the foveal prediction phenomenon that ties it closely to trans-saccadic stability is its specificity to the saccade target but not to other objects in the environment. However, it is not clear to what degree the observed findings are specific to saccade preparation and the peripheral saccade target. Should the observers be asked to make a saccade to another fixation location, or simply maintain passive fixation, will foveal retinotopic cortex similarly contain the object's identity information? Without these control conditions, the results are consistent with foveal prediction, but do not definitively demonstrate that as the cause, so claims need to be toned down.

      We fully agree with the reviewer and toned down claims about foveal prediction. We engage with the questions raised by the reviewer more thoroughly in the new discussion section “Foveal feedback during saccade preparation”.

      In addition, we agree that another condition in which subjects make a saccade towards a different location would have been a great addition that we also considered, but due to concerns with statistical power did not add. While including such a condition exceeds the scope of the current study, we included this limitation in the Discussion section (pg 10 ln 316) and hope that future studies will address this question.

      (4) Another critical aspect is the temporal locus of the feedback signal. In the paradigm, the authors ensured that the saccade target object was never foveated via the gaze-contingent procedure and a conservative data exclusion criterion, thus enabling the test of feedback signals to foveal retinotopic cortex. However, due to the temporal sluggishness of fMRI BOLD signals, it is unclear when the feedback signal arrives at the foveal retinotopic cortex. In other words, it is possible that the feedback signal arrives after the eyes land at the saccade target location. This possibility is also bolstered by Chambers et al. (2013)'s TMS study, where they found that TMS to the foveal cortex at 350-400 ms SOA interrupts the peripheral discrimination task. The authors should qualify their claims of the results occurring "during saccade preparation" (e.g., pg 1 ln 22) throughout the manuscript, and discuss the importance of temporal dynamics of the effect in supporting stability across saccades.

      We fully agree that the sluggishness of the fMRI signal presents an important challenge in investigating foveal feedback. We have now included this limitation in the discussion (pg 10 ln 306-318). We also clarify that our argument connects to previous studies investigating the temporal dynamics of foveal feedback using similar tasks (pg 10 ln 313-316). Specifically, in their psychophysical work, Kroell and Rolfs (2022) and (2025) showed that foveal feedback occurs before saccade execution with a peak around 80 ms before the eye movement. 

      (5) Relatedly, the claims that result in this paradigm reflect "activity exclusively related to predictive feedback" and "must originate from predictive rather than direct visual processes" (e.g., lines 60-65 and throughout) need to be toned down. The experimental design nicely rules out direct visual foveal stimulation, but predictive feedback is not the only alternative to that. The activation could also reflect mental imagery, visual working memory, attention, etc. Importantly, the experiment uses a block design, where the same exact image is presented multiple times over the block, and the activation is taken for the block as a whole. Thus, while at no point was the image presented at the fovea, there could still be more going on than temporally-specific and saccade-specific predictive feedback.

      We agree that those claims could have misled the reader. Our intention was to state that the activation originates from feedback rather than direct foveal stimulation because of the nature of the design. We have now clarified these statements (pg 2 ln 65) and also included a discussion of other effects including imagery and working memory in the limitations section (pg 10 ln 306-313).

      (6) The authors should avoid using the terms foveal feedback and foveal prediction interchangeably. To me, foveal feedback refers to the findings of Williams et al. (2008), where participants maintained passive fixation and discriminated objects in the periphery (see also Fan et al., 2016), whereas foveal prediction refers to the neural mechanism hypothesized by Kroell & Rolfs (2022), occurring before a saccade to the target object and contains task irrelevant feature information.

      We agree, and we have now adopted a clearer distinction between these terms, referring to foveal prediction only when discussing the distinct predictive nature of the effect discovered by Kroell and Rolfs (2022). Otherwise we referred to this effect as foveal feedback.

      (7) More broadly, the treatment of how foveal prediction relates to saccadic remapping is overly simplistic. The authors seem to be taking the perspective that remapping is an attentional phenomenon marked by remapping of only attentional/spatial pointers, but this is not the classic or widely accepted definition of remapping. Within the field of saccadic remapping, it is an ongoing debate whether (/how/where/when) information about stimulus content is remapped alongside spatial location (and also whether the attentional pointer concept is even neurophysiologically viable). This relationship between saccadic remapping and foveal prediction needs clarification and deeper treatment, in both the introduction and discussion.

      We thank the reviewer for their remarks. We reformulated the discussion section on “Saccadic remapping or foveal prediction” to include the nuances about spatial and feature remapping laid out in the reviewer’s comment (pg 8-9 ln 241-269). We also put a stronger focus on the special role the fovea seems to be playing regarding the feedback of visual features (pg 8-9 ln 265-269).

      (8) As part of this enhanced discussion, the findings should be better integrated with prior studies. E.g., there is some evidence for predictive remapping inducing integration of non-spatial features (some by the authors themselves; Harrison et al., 2013; Szinte et al., 2015). How do these findings relate to the observed results? Can the results simply be a special case of non-spatial feature integration between the currently attended and remapped location (fovea)? How are the results different from neurophysiological evidence for facilitation of the saccade target object's feature across the visual field (Burrow et al., 2014)? How might the results be reconciled with a prior fMRI study that failed to find decoding of stimulus content in remapped responses (Lescroart et al, 2016)? Might this reflect a difference between peripheral-to-peripheral vs peripheral-to-foveal remapping? A recent study by Chiu & Golomb (2025) provided supporting evidence for peripheral-to-fovea remapping (but not peripheral-to-peripheral remapping) of object-location binding (though in the post-saccadic time window), and suggested foveal prediction as the underlying mechanism.

      We thank the reviewer for raising these intriguing questions. We now address them in the revised discussion. We argue that the findings by Harrison et al., 2013 and Szinte et al., 2015 of presaccadic integration of features across two peripheral locations can be explained by presaccadic updating of spatial attention pointers rather than remapping of feature information (pg 8 ln 248-253). The lack of evidence for periphery-to-periphery remapping (Lescroart et al, 2016) and the recent study by Chiu & Golomb (2025) showing object-location binding from periphery to fovea nicely align with our characterization of foveal processing as unique in predicting feature information of upcoming stimuli (pg 8-9 ln 265-269). Finally, we argue that the global (i.e., space-invariant) selection task-irrelevant saccadic target features (Burrows et al., 2014) is well-established at the neural level, but does not suffice to explain the spatially specific nature of foveal prediction (pg 8 ln 220-224). We now include these studies in the revised discussion section.

      Reviewer #3 (Public review):

      Summary:

      In this paper, the authors used fMRI to determine whether peripherally viewed objects could be decoded from the foveal cortex, even when the objects themselves were never viewed foveally. Specifically, they investigated whether pre-saccadic target attributes (shape, semantic category) could be decoded from the foveal cortex. They found that object shape, but not semantic category, could be decoded, providing evidence that foveal feedback relies on low-mid-level information. The authors claim that this provides evidence for a mechanism underlying visual stability and object recognition across saccades.

      Strengths:

      I think this is another nice demonstration that peripheral information can be decoded from / is processed in the foveal cortex - the methods seem appropriate, and the experiments and analyses are carefully conducted, and the main results seem convincing. The paper itself was very clear and well-written.

      We thank the reviewer for this positive evaluation of our work. As discussed in our response to Reviewer 1, we now elaborate on the differences between previous work showing decoding of peripheral information from foveal cortex from the effect shown here. While there are important similarities between these findings, foveal prediction in our study occurs in a saccade condition and in the absence of a task that is specific to stimulus features. 

      Weaknesses:

      There are a couple of reasons why I think the main theoretical conclusions drawn from the study might not be supported, and why a more thorough investigation might be needed to draw these conclusions.

      (1) The authors used a blocked design, with each object being shown repeatedly in the same block. This meant that the stimulus was entirely predictable on each block, which weakens the authors' claims about this being a predictive mechanism that facilitates object recognition - if the stimulus is 100% predictable, there is no aspect of recognition or discrimination actually being tested. I think to strengthen these claims, an experiment would need to have unpredictable stimuli, and potentially combine behavioural reports with decoding to see whether this mechanism can be linked to facilitating object recognition across saccades.

      We appreciate the reviewer’s point and would like to highlight that it was not our intention to claim a behavioral effect on object recognition. We believe that an ambiguous formulation in the original abstract may have been interpreted this way, and we thus removed this reference. We also speculated in our Discussion that a potential reason for foveal prediction could be a headstart in peripheral object recognition and in the revised manuscript more clearly highlight that this is a  potential future direction only.

      (2)  Given that foveal feedback has been found in previous studies that don't incorporate saccades, how is this a mechanism that might specifically contribute to stability across saccades, rather than just being a general mechanism that aids the processing/discrimination of peripherally-viewed stimuli? I don't think this paper addresses this point, which would seem to be crucial to differentiate the results from those of previous studies.

      We fully agree that this point had not been sufficiently addressed in the previous version of the manuscript. As described in our responses to similar comments from reviewers 1 and 2, we included an additional section in the Discussion (“Foveal feedback during saccade preparation”) to more clearly delineate the present study from previous findings of foveal feedback. Previous studies (Williams et al., 2008) only found foveal feedback during narrow discrimination tasks related to spatial features of the target stimulus, not during color-discrimination or fixation-only tasks, concluding that the observed effect must be related to the discrimination behavior. In contrast, we found foveal feedback (as evidenced by decoding of target features) during a saccade condition that was independent of the target features, suggesting a different role of foveal feedback than hypothesized by Williams et al. (2008).

      Recommendations for the authors:  

      Reviewer #2 (Recommendations for the authors):

      (A) Minor comments:

      (1)  The task should be clarified earlier in the manuscript.

      We now characterise the task in the abstract and clarified its description in the third paragraph, right after introducing the main literature.

      (2) Is there actually only 0.5 seconds between saccades? This feels very short/rushed.

      The inter-trial-interval was 0.5 seconds, though effectively it varied because the target only appeared once participants fixated on the fixation dot. Note that this pacing is slower than the rate of saccades in natural vision (about 3 to 4 saccades per second).Participants did not report this paradigm as rushed.

      (3) Typo on pg2 ln64 (whooe).

      Fixed.

      (4)  Can the authors also show individual data points for Figures 3 and 4?

      We added individual data points for Figures 4 and S2

      (5) The MNI coordinates on Figure 4A seem to be incorrect.

      We took out those coordinates.

      (6) Pg4 ln126 and pg6 ln194, why cite Williams et al. (2008)?

      We included this reference here to acknowledge that Williams et al. raised the same issues. We added a “cf.” before this reference to clarify this.

      (7) Pg7 ln207 Fabius et al. (2020) showed slow post-saccadic feature remapping, rather than predictive remapping of spatial attention.

      We have corrected this mistake.

      (8) The OSF link is valid, but I couldn't find a pre-registration.

      The issue with the OSF link has been resolved. The pre-registration had been set up but not published. We now published it without changing the original pre-registration (see the screenshot attached).

      (9) I couldn't access the OpenNeuro repository.

      The issue with the OpenNeuro link has been resolved.

      (B) Additional references you may wish to include:

      (1) Burrows, B. E., Zirnsak, M., Akhlaghpour, H., Wang, M., & Moore, T.  (2014). Global selection of saccadic target features by neurons in area v4. Journal of Neuroscience.

      (2) Chambers, C. D., Allen, C. P., Maizey, L., & Williams, M. A. (2013). Is delayed foveal feedback critical for extra-foveal perception?. Cortex.

      (3) Chiu, T. Y., & Golomb, J. D. (2025). The influence of saccade target status on the reference frame of object-location binding. Journal of Experimental Psychology. General.

      (4) Harrison, W. J., Retell, J. D., Remington, R. W., & Mattingley, J. B. (2013). Visual crowding at a distance during predictive remapping. Current Biology.

      (5) Lescroart, M. D., Kanwisher, N., & Golomb, J. D. (2016). No evidence for automatic remapping of stimulus features or location found with fMRI. Frontiers in Systems Neuroscience.

      (6) Moran, C., Johnson, P. A., Hogendoorn, H., & Landau, A. N. (2025). The representation of stimulus features during stable fixation and active vision. Journal of Neuroscience.

      (7) Szinte, M., Jonikaitis, D., Rolfs, M., Cavanagh, P., & Deubel, H. (2016). Presaccadic motion integration between current and future retinotopic locations of attended objects. Journal of Neurophysiology.

      We thank the reviewer for pointing out these references. We have included them in the revised version of the manuscript.

      Reviewer #3 (Recommendations for the authors):

      I just have a few minor points where I think some clarifications could be made.

      (1) Line 64 - "whooe" should be "whoose" I think.

      Fixed.

      (2) Around line 53 - you might consider citing this review on foveal feedback - https://doi.org/10.1167/jov.20.12.2

      We included the reference (pg 2 ln 55).

      (3) Line 129 - you mention a u-shaped relationship for decoding - I wasn't quite sure of the significance/relevance of this relationship - it would be helpful to expand on this / clarify what this means.

      We have expanded this section and added statistical tests of the u-shaped relationship in decoding using a weighted quadratic regression. We found significant positive curvature in all early visual areas between fovea and periphery (V1: t(27) = 3.98, p = 0.008, V2: t(27) = 3.03, p = 0.02, V3: t(27)= 2.776, p = 0.025). These findings support a u-shaped relationship. We now report these results in the revised manuscript (pg 5 ln 137-138).

      (4) Figure 1 - it would be helpful to indicate how long the target was viewed in the "stim on" panels - I assume it was for the saccade latency, but it would be good to include those values in the main text.

      We included that detail in the text (pg 3 ln 96-97).

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1

      (1) Related to comment 3, related to the spatial communication section, either provide a clearer worked example or adjust the framing to avoid implying a more developed capability than is shown.

      We appreciate the reviewer’s feedback regarding the framing of the spatial communication section. We have removed this section from the revised version.

      (2) Related to comment 4 about resolution, consider including explicit numerical estimates of spatial resolution (e.g., median patch diameter in micrometers) for at least one dataset to help users understand practical mapping granularity.

      We appreciate the suggestion. We have added explicit numerical estimates of spatial resolution to clarify our mappings. Specifically, we now (i) define “patch” precisely and (ii) report the median patch diameter (in µm) for representative datasets:

      10x Visium (mouse cortex): spot diameter = 55 µm; center-to-center spacing = 100 µm.

      Slide-seqV2 (mouse brain): bead diameter ≈ 10 µm. When we optionally coarse-grain to 5×5 bead tiles for robustness, the effective patch diameter is ~50 µm

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study examines whether changes in pupil size index prediction-error-related updating during associative learning, formalised as information gain via Kullback-Leibler (KL) divergence. Across two independent tasks, pupil responses scaled with KL divergence shortly after feedback, with the timing and direction of the response varying by task. Overall, the work supports the view that pupil size reflects information-theoretic processes in a context-dependent manner.

      Strengths:

      This study provides a novel and convincing contribution by linking pupil dilation to informationtheoretic measures, such as KL divergence, supporting Zénon's hypothesis that pupil responses reflect information gain during learning. The robust methodology, including two independent datasets with distinct task structures, enhances the reliability and generalisability of the findings. By carefully analysing early and late time windows, the authors capture the timing and direction of prediction-error-related responses, oPering new insights into the temporal dynamics of model updating. The use of an ideal-learner framework to quantify prediction errors, surprise, and uncertainty provides a principled account of the computational processes underlying pupil responses. The work also highlights the critical role of task context in shaping the direction and magnitude of these ePects, revealing the adaptability of predictive processing mechanisms. Importantly, the conclusions are supported by rigorous control analyses and preprocessing sanity checks, as well as convergent results from frequentist and Bayesian linear mixed-ePects modelling approaches.

      Weaknesses:

      Some aspects of directionality remain context-dependent, and on current evidence cannot be attributed specifically to whether average uncertainty increases or decreases across trials. DiPerences between the two tasks (e.g., sensory modality and learning regime) limit direct comparisons of ePect direction and make mechanistic attribution cautious. In addition, subjective factors such as confidence were not measured and could influence both predictionerror signals and pupil responses. Importantly, the authors explicitly acknowledge these limitations, and the manuscript clearly frames them as areas for future work rather than settled conclusions.

      Reviewer #2 (Public review):

      Summary:

      The authors investigate whether pupil dilation reflects information gain during associative learning, formalised as Kullback-Leibler divergence within an ideal observer framework. They examine pupil responses in a late time window after feedback and compare these to informationtheoretic estimates (information gain, surprise, and entropy) derived from two diPerent tasks with contrasting uncertainty dynamics.

      Strength:

      The exploration of task evoked pupil dynamics beyond the immediate response/feedback period and then associating them with model estimates was interesting and inspiring. This oPered a new perspective on the relationship between pupil dilation and information processing.

      Weakness:

      However, the interpretability of the findings remains constrained by the fundamental diPerences between the two tasks (stimulus modality, feedback type, and learning structure), which confound the claimed context-dependent ePects. The later time-window pupil ePects, although intriguing, are small in magnitude and may reflect residual noise or task-specific arousal fluctuations rather than distinct information-processing signals. Thus, while the study oPers valuable methodological insight and contributes to ongoing debates about the role of the pupil in cognitive inference, its conclusions about the functional significance of late pupil responses should be treated with caution.

      Reviewer #3 (Public review):

      Summary:

      Thank you for inviting me to review this manuscript entitled "Pupil dilation oPers a time-window on prediction error" by Colizoli and colleagues. The study examines prediction errors, information gain (Kullback-Leibler [KL] divergence), and uncertainty (entropy) from an information-theory perspective using two experimental tasks and pupillometry. The authors aim to test a theoretical proposal by Zénon (2019) that the pupil response reflects information gain (KL divergence). The conclusion of this work is that (post-feedback) pupil dilation in response to information gain is context dependent.

      Strengths:

      Use of an established Bayesian model to compute KL divergence and entropy.

      Pupillometry data preprocessing and multiple robustness checks.

      Weaknesses:

      Operationalization of prediction errors based on frequency, accuracy, and their interaction:

      The authors rely on a more model-agnostic definition of the prediction error in terms of stimulus frequency ("unsigned prediction error"), accuracy, and their interaction ("signed prediction error"). While I see the point, I would argue that this approach provides a simple approximation of the prediction error, but that a model-based approach would be more appropriate.

      Model validation:

      My impression is that the ideal learner model should work well in this case. However, the authors don't directly compare model behavior to participant behavior ("posterior predictive checks") to validate the model. Therefore, it is currently unclear if the model-derived terms like KL divergence and entropy provide reasonable estimates for the participant data.

      Lack of a clear conclusion:

      The authors conclude that this study shows for the first time that (post-feedback) pupil dilation in response to information gain is context dependent. However, the study does not oPer a unifying explanation for such context dependence. The discussion is quite detailed with respect to taskspecific ePects, but fails to provide an overarching perspective on the context-dependent nature of pupil signatures of information gain. This seems to be partly due to the strong diPerences between the experimental tasks.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I highly appreciate the care and detail in the authors' response and thank them for the ePort invested in revising the manuscript. They addressed the core concerns to a high standard, and the manuscript has substantially improved in methodological rigour (through additional controls/sanity checks and complementary mixed-ePects analyses) and in clarity of interpretation (by explicitly acknowledging context-dependence and tempering stronger claims). The present version reads clearly and is much strengthened overall. I only have a few minor points below:

      Minor suggestions:

      Abstract:

      In the abstract KL is introduced as abbreviation, but at first occurence it should be written out as "Kullback-Leibler (KL)" for readers not familiar with it.

      We thank the reviewer for catching this error. It has been correct in the version of record.

      Methods:

      I appreciate the additional bayesian LME analysis. I only had a few things that I thought were missing from knowing the parameters: 1) what was the target acceptance rate (default of .95?), 2) which family was used to model the response distribution: (default) "gaussian" or robust "student-t"? Depending on the data a student-t would be preferred, but since the author's checked the fit & the results corroborate the correlation analysis, using the default would also be fine! Just add the information for completeness.

      Thank you for bringing this to our attention. We have now noted that default parameters were used in all cases unless otherwise mentioned. 

      Thank you once again for your time and consideration.

      Reviewer #2 (Recommendations for the authors):

      Thanks to the authors' ePort on revision. I am happy with this new version of manuscript.

      Thank you once again for your time and consideration.

      Reviewer #3 (Recommendations for the authors):

      (1) Regarding comments #3 and #6 (first round) on model validation and posterior predictive checks, the authors replied that since their model is not a "generative" one, they can't perform posterior predictive checks. Crucially, in eq. 2, the authors present the p{tilde}^j_k variable denoting the learned probability of event k on trial j. I don't see why this can't be exploited for simulations. In my opinion, one could (and should) generate predictions based on this variable. The simplest implementation would translate the probability into a categorical choice (w/o fitting any free parameter). Based on this, they could assess whether the model and data are comparable.

      We thank the reviewer for this clarification. The reviewer suggests using the probability distributions at each trial to predict which event should be chosen on each trial. More specifically, the event(s) with the highest probability on trial j could be used to generate a prediction for the choice of the participant on trial j. We agree that this would indeed be an interesting analysis. However, the response options of each task are limited to two-alternatives. In the cue-target task, four events are modeled (representing all possible cue-target conditions) while the participants’ response options are only “left” and “right”. Similarly, in the letter-color task, 36 events are modeled while the participants’ response options are “match” and “no-match”. In other words, we do not know which event (either four or 36, for the two tasks) the participant would have indicated on each trial. As an approximation to this fine-grained analysis, we investigated the relationship between the information-theoretic variables separately for error and correct trials. Our rationale was that we would have more insight into how the model fits depended on the participants’ actual behavior as compared with the ideal learner model.

      (2) I recommend providing a plot of the linear mixed model analysis of the pupil data. Currently, results are only presented in the text and tables, but a figure would be much more useful.

      We thank the reviewer for the suggestion to add a plot of the linear mixed model results. We appreciate the value of visualizing model estimates; however, we feel that the current presentation in the text and tables clearly conveys the relevant findings. For this reason, and to avoid further lengthening the manuscript, we prefer to retain the current format.

      (3) I would consider only presenting the linear mixed ePects for the pupil data in the main results, and the correlation results in the supplement. It is currently quite long.

      We thank the reviewer for this recommendation. We agree that the results section is detailed; however, we consider the correlation analyses to be integral to the interpretation of the pupil data and therefore prefer to keep them in the main text rather than move them to the supplement.


      The following is the authors’ response to the original reviews

      eLife Assessment

      This important study seeks to examine the relationship between pupil size and information gain, showing opposite effects dependent upon whether the average uncertainty increases or decreases across trials. Given the broad implications for learning and perception, the findings will be of broad interest to researchers in cognitive neuroscience, decision-making, and computational modelling. Nevertheless, the evidence in support of the particular conclusion is at present incomplete - the conclusions would be strengthened if the authors could both clarify the differences between model-updating and prediction error in their account and clarify the patterns in the data.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study investigates whether pupil dilation reflects prediction error signals during associative learning, defined formally by Kullback-Leibler (KL) divergence, an information-theoretic measure of information gain. Two independent tasks with different entropy dynamics (decreasing and increasing uncertainty) were analyzed: the cue-target 2AFC task and the lettercolor 2AFC task. Results revealed that pupil responses scaled with KL divergence shortly after feedback onset, but the direction of this relationship depended on whether uncertainty (entropy) increased or decreased across trials. Furthermore, signed prediction errors (interaction between frequency and accuracy) emerged at different time windows across tasks, suggesting taskspecific temporal components of model updating. Overall, the findings highlight that pupil dilation reflects information-theoretic processes in a complex, context-dependent manner.

      Strengths:

      This study provides a novel and convincing contribution by linking pupil dilation to informationtheoretic measures, such as KL divergence, supporting Zénon's hypothesis that pupil responses reflect information gained during learning. The robust methodology, including two independent datasets with distinct entropy dynamics, enhances the reliability and generalisability of the findings. By carefully analysing early and late time windows, the authors capture the temporal dynamics of prediction error signals, offering new insights into the timing of model updates. The use of an ideal learner model to quantify prediction errors, surprise, and entropy provides a principled framework for understanding the computational processes underlying pupil responses. Furthermore, the study highlights the critical role of task context - specifically increasing versus decreasing entropy - in shaping the directionality and magnitude of these effects, revealing the adaptability of predictive processing mechanisms.

      Weaknesses:

      While this study offers important insights, several limitations remain. The two tasks differ significantly in design (e.g., sensory modality and learning type), complicating direct comparisons and limiting the interpretation of differences in pupil dynamics. Importantly, the apparent context-dependent reversal between pupil constriction and dilation in response to feedback raises concerns about how these opposing effects might confound the observed correlations with KL divergence. 

      We agree with the reviewer’s concerns and acknowledge that the speculation concerning the directional effect of entropy across trials can not be fully substantiated by the current study. As the reviewer points out, the directional relationship between pupil dilation and information gain must be due to other factors, for instance, the sensory modality, learning type, or the reversal between pupil constriction and dilation across the two tasks. Also, we would like to note that ongoing experiments in our lab already contradict our original speculation. In line with the reviewer’s point, we noted these differences in the section on “Limitations and future research” in the Discussion. To better align the manuscript with the above mentioned points, we have made several changes in the Abstract, Introduction and Discussion summarized below: 

      We have removed the following text from the Abstract and Introduction: “…, specifically related to increasing or decreasing average uncertainty (entropy) across trials.”

      We have edited the following text in the Introduction (changes in italics) (p. 5):

      “We analyzed two independent datasets featuring distinct associative learning paradigms, one characterized by increasing entropy and the other by decreasing entropy as the tasks progressed. By examining these different tasks, we aimed to identify commonalities (if any) in the results across varying contexts. Additionally, the contrasting directions of entropy in the two tasks enabled us to disentangle the correlation between stimulus-pair frequency and information gain in the postfeedback pupil response.

      We have removed the following text from the Discussion:

      “…and information gain in fact seems to be driven by increased uncertainty.”

      “We speculate that this difference in the direction of scaling between information gain and the pupil response may depend on whether entropy was increasing or decreasing across trials.” 

      “…which could explain the opposite direction of the relationship between pupil dilation and information gain”

      “… and seems to relate to the direction of the entropy as learning progresses (i.e., either increasing or decreasing average uncertainty).” 

      We have edited the following texts in the Discussion (changes in italics):

      “For the first time, we show that the direction of the relationship between postfeedback pupil dilation and information gain (defined as KL divergence) was context dependent.” (p. 29):

      Finally, we have added the following correction to the Discussion (p. 30):

      “Although it is tempting to speculate that the direction of the relationship between pupil dilation and information gain may be due to either increasing or decreasing entropy as the task progressed, we must refrain from this conclusion. We note that the two tasks differ substantially in terms of design with other confounding variables and therefore cannot be directly compared to one another. We expand on these limitations in the section below (see Limitations and future research).”

      Finally, subjective factors such as participants' confidence and internal belief states were not measured, despite their potential influence on prediction errors and pupil responses.

      Thank you for the thoughtful comment. We agree with the reviewer that subjective factors, such as participants' confidence, can be important in understanding prediction errors and pupil responses. As per the reviewer’s point, we have included the following limitation in the Discussion (p. 33): 

      “Finally, while we acknowledge the potential relevance of subjective factors, such as the participants’ overt confidence reports, in understanding prediction errors and pupil responses, the current study focused on the more objective, model-driven measure of information-theoretic variables. This approach aligns with our use of the ideal learner model, which estimates information-theoretic variables while being agnostic about the observer's subjective experience itself. Future research is needed to explore the relationship between information-gain signals in pupil dilation and the observer’s reported experience of or awareness about confidence in their decisions.” 

      Reviewer #2 (Public review):

      Summary:

      The authors proposed that variability in post-feedback pupillary responses during the associative learning tasks can be explained by information gain, which is measured as KL divergence. They analysed pupil responses in a later time window (2.5s-3s after feedback onset) and correlated them with information-theory-based estimates from an ideal learner model (i.e., information gain-KL divergence, surprise-subjective probability, and entropy-average uncertainty) in two different associative decision-making tasks.

      Strength:

      The exploration of task-evoked pupil dynamics beyond the immediate response/feedback period and then associating them with model estimates was interesting and inspiring. This offered a new perspective on the relationship between pupil dilation and information processing.

      Weakness:

      However, disentangling these later effects from noise needs caution. Noise in pupillometry can arise from variations in stimuli and task engagement, as well as artefacts from earlier pupil dynamics. The increasing variance in the time series of pupillary responses (e.g., as shown in Figure 2D) highlights this concern.

      It's also unclear what this complicated association between information gain and pupil dynamics actually means. The complexity of the two different tasks reported made the interpretation more difficult in the present manuscript.

      We share the reviewer’s concerns. To make this point come across more clearly, we have added the following text to the Introduction (p. 5):

      “The current study was motivated by Zenon’s hypothesis concerning the relationship between pupil dilation and information gain, particularly in light of the varying sources of signal and noise introduced by task context and pupil dynamics. By demonstrating how task context can influence which signals are reflected in pupil dilation, and highlighting the importance of considering their temporal dynamics, we aim to promote a more nuanced and model-driven approach to cognitive research using pupillometry.”

      Reviewer #3 (Public review):

      Summary:

      This study examines prediction errors, information gain (Kullback-Leibler [KL] divergence), and uncertainty (entropy) from an information-theory perspective using two experimental tasks and pupillometry. The authors aim to test a theoretical proposal by Zénon (2019) that the pupil response reflects information gain (KL divergence). In particular, the study defines the prediction error in terms of KL divergence and speculates that changes in pupil size associated with KL divergence depend on entropy. Moreover, the authors examine the temporal characteristics of pupil correlates of prediction errors, which differed considerably across previous studies that employed different experimental paradigms. In my opinion, the study does not achieve these aims due to several methodological and theoretical issues.

      Strengths:

      (1)  Use of an established Bayesian model to compute KL divergence and entropy.

      (2)  Pupillometry data preprocessing, including deconvolution.

      Weaknesses:

      (1) Definition of the prediction error in terms of KL divergence:

      I'm concerned about the authors' theoretical assumption that the prediction error is defined in terms of KL divergence. The authors primarily refer to a review article by Zénon (2019): "Eye pupil signals information gain". It is my understanding that Zénon argues that KL divergence quantifies the update of a belief, not the prediction error: "In short, updates of the brain's internal model, quantified formally as the Kullback-Leibler (KL) divergence between prior and posterior beliefs, would be the common denominator to all these instances of pupillary dilation to cognition." (Zénon, 2019).

      From my perspective, the update differs from the prediction error. Prediction error refers to the difference between outcome and expectation, while update refers to the difference between the prior and the posterior. The prediction error can drive the update, but the update is typically smaller, for example, because the prediction error is weighted by the learning rate to compute the update. My interpretation of Zénon (2019) is that they explicitly argue that KL divergence defines the update in terms of the described difference between prior and posterior, not the prediction error.

      The authors also cite a few other papers, including Friston (2010), where I also could not find a definition of the prediction error in terms of KL divergence. For example [KL divergence:] "A non-commutative measure of the non-negative difference between two probability distributions." Similarly, Friston (2010) states: Bayesian Surprise - "A measure of salience based on the Kullback-Leibler divergence between the recognition density (which encodes posterior beliefs) and the prior density. It measures the information that can be recognized in the data." Finally, also in O'Reilly (2013), KL divergence is used to define the update of the internal model, not the prediction error.

      The authors seem to mix up this common definition of the model update in terms of KL divergence and their definition of prediction error along the same lines. For example, on page 4: "KL divergence is a measure of the difference between two probability distributions. In the context of predictive processing, KL divergence can be used to quantify the mismatch between the probability distributions corresponding to the brain's expectations about incoming sensory input and the actual sensory input received, in other words, the prediction error (Friston, 2010; Spratling, 2017)."

      Similarly (page 23): "In the current study, we investigated whether the pupil's response to decision outcome (i.e., feedback) in the context of associative learning reflects a prediction error as defined by KL divergence."

      This is problematic because the results might actually have limited implications for the authors' main perspective (i.e., that the pupil encodes prediction errors) and could be better interpreted in terms of model updating. In my opinion, there are two potential ways to deal with this issue:

      (a) Cite work that unambiguously supports the perspective that it is reasonable to define the prediction error in terms of KL divergence and that this has a link to pupillometry. In this case, it would be necessary to clearly explain the definition of the prediction error in terms of KL divergence and dissociate it from the definition in terms of model updating.

      (b) If there is no prior work supporting the authors' current perspective on the prediction error, it might be necessary to revise the entire paper substantially and focus on the definition in terms of model updating.

      We thank the reviewer for pointy out these inconsistencies in the manuscript and appreciate their suggestions for improvement. We take approach (a) recommended by the reviewer, and provide our reasoning as to why prediction error signals in pupil dilation are expected to correlate with information gain (defined as the KL divergence between posterior and prior belief distributions). This can be found in a new section in the introduction, copied here for convenience (p. 3-4):

      “We reasoned that the link between prediction error signals and information gain in pupil dilation is through precision-weighting. Precision refers to the amount of uncertainty (inverse variance) of both the prior belief and sensory input in the prediction error signals [6,64–67]. More precise prediction errors receive more weighting, and therefore, have greater influence on model updating processes. The precisionweighting of prediction error signals may provide a mechanism for distinguishing between known and unknown sources of uncertainty, related to the inherent stochastic nature of a signal versus insufficient information of the part of the observer, respectively [65,67,68]. In Bayesian frameworks, information gain is fundamentally linked to prediction error, modulated by precision [65,66,69–75]. In non-hierarchical Bayesian models, information gain can be derived as a function of prediction errors and the precision of the prior and likelihood distributions, a relationship that can be approximately linear [70]. In hierarchical Bayesian inference, the update in beliefs (posterior mean changes) at each level is proportional to the precision-weighted prediction error; this update encodes the information gained from new observations [65,66,69,71,72]. Neuromodulatory arousal systems are well-situated to act as precision-weighting mechanisms in line with predictive processing frameworks [76,77]. Empirical evidence suggests that neuromodulatory systems broadcast precisionweighted prediction errors to cortical regions [11,59,66,78]. Therefore, the hypothesis that feedback-locked pupil dilation reflects a prediction error signal is similarly in line with Zenon’s main claim that pupil dilation generally reflects information gain, through precision-weighting of the prediction error. We expected a prediction error signal in pupil dilation to be proportional to the information gain.”

      We have referenced previous work that has linked prediction error and information gain directly (p. 4): “The KL divergence between posterior and prior belief distributions has been previously considered to be a proxy of (precision-weighted) prediction errors [68,72].”

      We have taken the following steps to remedy this error of equating “prediction error” directly with the information gain.

      First, we have replaced “KL divergence” with “information gain” whenever possible throughout the manuscript for greater clarity. 

      Second, we have edited the section in the introduction defining information gain substantially (p. 4): 

      “Information gain can be operationalized within information theory as the KullbackLeibler (KL) divergence between the posterior and prior belief distributions of a Bayesian observer, representing a formalized quantity that is used to update internal models [29,79,80]. Itti and Baldi (2005)81 termed the KL divergence between posterior and prior belief distributions as “Bayesian surprise” and showed a link to the allocation of attention. The KL divergence between posterior and prior belief distributions has been previously considered to be a proxy of (precision-weighted) prediction errors[68,72]. According to Zénon’s hypothesis, if pupil dilation reflects information gain during the observation of an outcome event, such as feedback on decision accuracy, then pupil size will be expected to increase in proportion to how much novel sensory evidence is used to update current beliefs [29,63]. ” 

      Finally, we have made several minor textual edits to the Abstract and main text wherever possible to further clarify the proposed relationship between prediction errors and information gain.

      (2) Operationalization of prediction errors based on frequency, accuracy, and their interaction:

      The authors also rely on a more model-agnostic definition of the prediction error in terms of stimulus frequency ("unsigned prediction error"), accuracy, and their interaction ("signed prediction error"). While I see the point here, I would argue that this approach offers a simple approximation to the prediction error, but it is possible that factors like difficulty and effort can influence the pupil signal at the same time, which the current approach does not take into account. I recommend computing prediction errors (defined in terms of the difference between outcome and expectation) based on a simple reinforcement-learning model and analyzing the data using a pupillometry regression model in which nuisance regressors are controlled, and results are corrected for multiple comparisons.

      We agree with the reviewer’s suggestion that alternatively modeling the data in a reinforcement learning paradigm would be fruitful. We adopted the ideal learner model as we were primarily focused on Information Theory, stemming from our aim to test Zenon’s hypothesis that information gain drives pupil dilation. However, we agree with the reviewer that it is worthwhile to pursue different modeling approaches in future work. We have now included a complementary linear mixed model analysis in which we controlled for the effects of the information-theoretic variables on one another, while also including the nuisance regressors of pre-feedback baseline pupil dilation and reaction times (explained in more detail below in our response to your point #4). Results including correction for multiple comparisons was reported for all pupil time course data as detailed in Methods section 2.5. 

      (3) The link between model-based (KL divergence) and model-agnostic (frequency- and accuracy-based) prediction errors:

      I was expecting a validation analysis showing that KL divergence and model-agnostic prediction errors are correlated (in the behavioral data). This would be useful to validate the theoretical assumptions empirically.

      The model limitations and the operalization of prediction error in terms of post-feedback processing do not seem to allow for a comparison of information gain and model-agnostic prediction errors in the behavioral data for the following reasons. First, the simple ideal learner model used here is not a generative model, and therefore, cannot replicate or simulate the participants responses (see also our response to your point #6 “model validation” below). Second, the behavioral dependent variables obtained are accuracy and reaction times, which both occur before feedback presentation. While accuracy and reaction times can serve as a marker of the participant’s (statistical) confidence/uncertainty following the decision interval, these behavioral measures cannot provide access to post-feedback information processing. The pupil dilation is of interest to us because the peripheral arousal system is able to provide a marker of post-feedback processing. Through the analysis presented in Figure 3, we indeed aimed to make the comparison of the model-based information gain to the model-agnostic prediction errors via the proxy variable of post-feedback pupil dilation instead of behavioral variables. To bridge the gap between the “behaviorally agnostic” model parameters and the actual performance of the participants, we examined the relationship between the model-based information gain and the post-feedback pupil dilation separately for error and correct trials as shown in Figure 3D-F & Figure 3J-L. We hope this addresses the reviewers concern and apologize in case we did not understand the reviewers suggestion here.

      (4) Model-based analyses of pupil data:

      I'm concerned about the authors' model-based analyses of the pupil data. The current approach is to simply compute a correlation for each model term separately (i.e., KL divergence, surprise, entropy). While the authors do show low correlations between these terms, single correlational analyses do not allow them to control for additional variables like outcome valence, prediction error (defined in terms of the difference between outcome and expectation), and additional nuisance variables like reaction time, as well as x and y coordinates of gaze.

      Moreover, including entropy and KL divergence in the same regression model could, at least within each task, provide some insights into whether the pupil response to KL divergence depends on entropy. This could be achieved by including an interaction term between KL divergence and entropy in the model.

      In line with the reviewer’s suggestions, we have included a complementary linear mixed model analysis in which we controlled for the effects of the information-theoretic variables on one another, while also including the nuisance regressors of pre-feedback baseline pupil dilation and reaction times. We compared the performance of two models on the post-feedback pupil dilation in each time window of interest: Modle 1 had no interaction between information gain and entropy and Model 2 included an interaction term as suggested. We did not include the x- and y- coordinates of gaze in the mixed linear model analysis, as there are multiple values of these coordinates per trial. Furthermore, regressing out the x and y- coordinates of gaze can potentially remove signal of interest in the pupil dilation data in addition to the gaze-related confounds and we did not measure absolute pupil size (Mathôt, Melmi & Castet, 2015; Hayes & Petrov, 2015). We present more sanity checks on the pre-processing pipeline as recommended by Reviewer 1.  

      This new analysis resulted in several additions to the Methods (see Section 2.5) and Results. In sum, we found that including an interaction term for information gain and entropy did not lead to better model fits, but sometimes lead to significantly worse fits. Overall, the results of the linear mixed model corroborated the “simple” correlation analysis across the pupil time course while accounting for the relationship to the pre-feedback baseline pupil and preceeding reaction time differences. There was only one difference to note between the correlation and linear mixed modeling analyses: for the error trials in the cue-target 2AFC task, including entropy in the model accounted for the variance previously explained by surprise.

      (5) Major differences between experimental tasks:

      More generally, I'm not convinced that the authors' conclusion that the pupil response to KL divergence depends on entropy is sufficiently supported by the current design. The two tasks differ on different levels (stimuli, contingencies, when learning takes place), not just in terms of entropy. In my opinion, it would be necessary to rely on a common task with two conditions that differ primarily in terms of entropy while controlling for other potentially confounding factors. I'm afraid that seemingly minor task details can dramatically change pupil responses. The positive/negative difference in the correlation with KL divergence that the authors interpret to be driven by entropy may depend on another potentially confounding factor currently not controlled.

      We agree with the reviewer’s concerns and acknowledge that the speculation concerning the directional effect of entropy across trials can not be fully substantiated by the currect study. We note that Review #1 had a similar concern. Our response to Reviewer #1 addresses this concern of Reviewer #3 as well. To better align the manuscript with the above mentioned points, we have made several changes that are detailed in our response to Reviewer #1’s public review (above). 

      (6) Model validation:

      My impression is that the ideal learner model should work well in this case. However, the authors don't directly compare model behavior to participant behavior ("posterior predictive checks") to validate the model. Therefore, it is currently unclear if the model-derived terms like KL divergence and entropy provide reasonable estimates for the participant data.

      Based on our understanding, posterior predictive checks are used to assess the goodness of fit between generated (or simulated) data and observed data. Given that the “simple” ideal learner model employed in the current study is not a generative model, a posterior predictive check would not apply here (Gelman, Carlin, Stern, Dunson, Vehtari, & Rubin (2013). The ideal learner model is unable to simulate or replicate the participants’ responses and behaviors such as accuracy and reaction times; it simply computes the probability of seeing each stimulus type at each trial based on the prior distribution and the exact trial order of the stimuli presented to each participant. The model’s probabilities are computed directly from a Dirichlet distribution of values that represent the number of occurences of each stimulus-pair type for each task. The information-theoretic variables are then directly computed from these probabilities using standard formulas. The exact formulas used in the ideal learner model can be found in section 2.4.

      We have now included a complementary linear mixed model analysis which also provides insight into the amount of explained variance of these information-theoretic predictors on the post-feedback pupil response, while also including the pre-feedback baseline pupil and reaction time differences (see section 3.3, Tables 3 & 4). The R<sup>2</sup> values ranged from 0.16 – 0.50 across all conditions tested.

      (7) Discussion:

      The authors interpret the directional effect of the pupil response w.r.t. KL divergence in terms of differences in entropy. However, I did not find a normative/computational explanation supporting this interpretation. Why should the pupil (or the central arousal system) respond differently to KL divergence depending on differences in entropy?

      The current suggestion (page 24) that might go in this direction is that pupil responses are driven by uncertainty (entropy) rather than learning (quoting O'Reilly et al. (2013)). However, this might be inconsistent with the authors' overarching perspective based on Zénon (2019) stating that pupil responses reflect updating, which seems to imply learning, in my opinion. To go beyond the suggestion that the relationship between KL divergence and pupil size "needs more context" than previously assumed, I would recommend a deeper discussion of the computational underpinnings of the result.

      Since we have removed the original speculative conclusion from the manuscript, we will refrain from discussing the computational underpinnings of a potential mechanism. To note as mentioned above, we have preliminary data from our own lab that contradicts our original hypothesis about the relationship between entropy and information gain on the post-feedback pupil response. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Apart from the points raised in the public review above, I'd like to use the opportunity here to provide a more detailed review of potential issues, questions, and queries I have:

      (1) Constriction vs. Dilation Effects:

      The study observes a context-dependent relationship between KL divergence and pupil responses, where pupil dilation and constriction appear to exhibit opposing effects. However, this phenomenon raises a critical concern: Could the initial pupil constriction to visual stimuli (e.g., in the cue-target task) confound correlations with KL divergence? This potential confound warrants further clarification or control analyses to ensure that the observed effects genuinely reflect prediction error signals and are not merely a result of low-level stimulus-driven responses.

      We agree with the reviewers concern and have added the following information to the limitations section in the Discussion (changes in italics below; p. 32-33).

      “First, the two associative learning paradigms differed in many ways and were not directly comparable. For instance, the shape of the mean pupil response function differed across the two tasks in accordance with a visual or auditory feedback stimulus (compare Supplementary Figure 3A with Supplementary Figure 3D), and it is unclear whether these overall response differences contributed to any differences obtained between task conditions within each task. We are unable to rule out whether so-called “low level” effects such as the initial constriction to visual stimuli in the cue-target 2AFC task as compared with the dilation in response auditory stimuli in letter-color 2AFC task could confound correlations with information gain. Future work should strive to disentangle how the specific aspects of the associative learning paradigms relate to prediction errors in pupil dilation by systematically manipulating design elements within each task.”

      Here, I also was curious about Supplementary Figure 1, showing 'no difference' between the two tones (indicating 'error' or 'correct'). Was this the case for FDR-corrected or uncorrected cluster statistics? Especially since the main results also showed sig. differences only for uncorrected cluster statistics (Figure 2), but were n.s. for FDR corrected. I.e. can we be sure to rule out a confound of the tones here after all?

      As per the reviewer’s suggestion, we verified that there were also no significant clusters after feedback onset before applying the correction for multiple comparisons. We have added this information to Supplemenatary section 1.2 as follows: 

      “Results showed that the auditory tone dilated pupils on average (Supplementary Figure 1C). Crucially, however, the two tones did not differ from one another in either of the time windows of interest (Supplementary Figure 1D; no significant time points after feedback onset were obtained either before or after correcting for multiple comparisons using cluster-based permutation methods; see Section 2.5.” 

      Supplementary Figure 1 is showing effects cluster-corrected for multiple comparisons using cluster-based permutation tests from the MNE software package in Python (see Methods section 2.5). We have clarified that the cluster-correction was based on permutation testing in the figure legend. 

      (2) Participant-Specific Priors:

      The ideal learner models do not account for individualised priors, assuming homogeneous learning behaviour across participants. Could incorporating participant-specific priors better reflect variability in how individuals update their beliefs during associative learning?

      We have clarified in the Methods (see section 2.4) that the ideal learner models did account for participant-specific stimuli including participant-specific priors in the letter-color 2AFC task. We have added the following texts: 

      “We also note that while the ideal learner model for the cue-target 2AFC task used a uniform (flat) prior distribution for all participants, the model parameters were based on the participant-specific cue-target counterbalancing conditions and randomized trial order.” (p. 13)

      “The prior distributions used for the letter-color 2AFC task were estimated from the randomized letter-color pairs and randomized trial order presentation in the preceding odd-ball task; this resulted in participant-specific prior distributions for the ideal learner model of the letter-color 2AFC task. The model parameters were likewise estimated from the (participant-specific) randomized trial order presented in the letter-color 2AFC task.” (p. 13)

      (3) Trial-by-Trial Variability:

      The analysis does not account for random effects or inter-trial variability using mixed-effects models. Including such models could provide a more robust statistical framework and ensure the observed relationships are not influenced by unaccounted participant- or trial-specific factors.

      We have included a complementary linear mixed model analysis in which “subject” was modeled as a random effect on the post-feedback pupil response in each time window of interest and for each task. Across all trials, the results of the linear mixed model corroborated the “simple” correlation analysis across the pupil time course while accounting for the relationship to the prefeedback baseline pupil and preceeding reaction time differences (see section 3.3, Tables 3 & 4).

      (4) Preprocessing/Analysis choices:

      Before anything else, I'd like to highlight the authors' effort in providing public code (and data) in a very readable and detailed format!

      We appreciate the compliment - thank you for taking the time to look at the data and code provided.

      I found the idea of regressing the effect of Blinks/Saccades on the pupil trace intriguing. However, I miss a complete picture here to understand how well this actually worked, especially since it seems to be performed on already interpolated data. My main points here are:

      (4.1) Why is the deconvolution performed on already interpolated data and not on 'raw' data where there are actually peaks of information to fit?

      To our understanding, at least one critical reason for interpolating the data before proceeding with the deconvolution analysis is that the raw data contain many missing values (i.e., NaNs) due to the presence of blinks. Interpolating over the missing data first ensures that there are valid numerical elements in the linear algebra equations. We refer the reviewer to the methods detailed in Knapen et al. (2016) for more details on this pre-processing method. 

      (4.2) What is the model fit (e.g. R-squared)? If this was a poor fit for the regressors in the first place, can we trust the residuals (i.e. clean pupil trace)? Is it possible to plot the same Pupil trace of Figure 1D with a) the 'raw' pupil time-series, b) after interpolation only (both of course also mean-centered for comparison), on top of the residuals after deconvolution (already presented), so we can be sure that this is not driving the effects in a 'bad' way? I'd just like to make sure that this approach did not lead to artefacts in the residuals rather than removing them.

      We thank the reviewer for this suggestion. In the Supplementary Materials, we have included a new figure (Supplementary Figure 2, copied below for convience), which illustrates the same conditions as in Figure 1D and Figure 2D, with 1) the raw data, and 2) the interpolated data before the nuisance regression. Both the raw data and interpolated data have been band-pass filtered as was done in the original pre-processing pipeline and converted to percent signal change. These figures can be compared directly to Figure 1D and Figure 2D, for the two tasks, respectively. 

      Of note is that the raw data seem to be dominated by responses to blinks (and/or saccades). Crucially, the pattern of results remains overall unchaged between the interpolated-only and fully pre-processed version of the data for both tasks. 

      In the Supplementary Materials (see Supplementary section 2), we have added the descriptives of the model fits from the deconvolution method. Model fits (R<sup>2</sup>) for the nuisance regression were generally low: cue-target 2AFC task, M = 0.03, SD = 0.02, range = [0.00, 0.07]; letter-color visual 2AFC, M = 0.08, SD = 0.04, range = [0.02, 0.16].

      Furthermore, a Pearson correlation analysis between the interpolated and fully pre-processed data within the time windows of interest for both task indicated high correspondence: 

      Cue-target 2AFC task

      Early time window: M = 0.99, SD = 0.01, range = [0.955, 1.000]

      Late time window: M = 0.99, SD = 0.01, range = [0.971, 1.000]

      Letter-color visual 2AFC

      Early time window: M = 0.95, SD = 0.04, range = [0.803, 0.998]

      Late time window: M = 0.97, SD = 0.02, range = [0.908, 0.999]

      In hindsight, including the deconvolution (nuisance regression) method may not have changed the pattern of results much. However, the decision to include this deconvolution method was not data-driven; instead, it was based on the literature establishing the importance of removing variance (up to 5 s) of these blinks and saccades from cognitive effects of interest in pupil dilation (Knapen et al., 2016). 

      (4.3) Since this should also lead to predicted time series for the nuisance-regressors, can we see a similar effect (of what is reported for the pupil dilation) based on the blink/saccade traces of a) their predicted time series based on the deconvolution, which could indicate a problem with the interpretation of the pupil dilation effects, and b) the 'raw' blink/saccade events from the eye-tracker? I understand that this is a very exhaustive analysis so I would actually just be interested here in an averaged time-course / blink&saccade frequency of the same time-window in Figure 1D to complement the PD analysis as a sanity check.

      Also included in the Supplementary Figure 2 is the data averaged as in Figure 1D and Figure 2D for the raw data and nuisance-predictor time courses (please refer to the bottom row of the sub-plots). No pattern was observed in either the raw data or the nuisance predictors as was shown in the residual time courses. 

      (4.4) How many samples were removed from the time series due to blinks/saccades in the first place? 150ms for both events in both directions is quite a long bit of time so I wonder how much 'original' information of the pupil was actually left in the time windows of interest that were used for subsequent interpretations.

      We thank the reviewer for bringing this issue to our attention. The size of the interpolation window was based on previous literature, indicating a range of 100-200 ms as acceptable (Urai et al., 2017; Knapen et al., 2016; Winn et al., 2018). The ratio of interpolated-to-original data (across the entire trial) varied greatly between participants and between trials: cue-target 2AFC task, M = 0.262, SD = 0.242, range = [0,1]; letter-color 2AFC task, M = 0.194, SD = 0.199, range = [0,1]. 

      We have now included a conservative analysis in which only trials with more than half (threshold = 60%) of original data are included in the analyses. Crucially, we still observe the same pattern of effects as when all data are considered across both tasks (compare the second to last row in the Supplementary Figure 2 to Figure 1D and Figure 2D).

      (4.5) Was the baseline correction performed on the percentage change unit?

      Yes, the baseline correction was performed on the pupil timeseries after converting to percentsignal change. We have added that information to the Methods (section 2.3).

      (4.6) What metric was used to define events in the derivative as 'peaks'? I assume some sort of threshold? How was this chosen?

      The threshold was chosen in a data-driven manner and was kept consistent across both tasks. The following details have been added to the Methods:

      “The size of the interpolation window preceding nuisance events was based on previous literature [13,39,99]. After interpolation based on data-markers and/or missing values, remaining blinks and saccades were estimated by testing the first derivative of the pupil dilation time series against a threshold rate of change. The threshold for identifying peaks in the temporal derivative is data-driven, partially based on past work[10,14,33]. The output of each participant’s pre-processing pipeline was checked visually. Once an appropriate threshold was established at the group level, it remained the same for all participants (minimum peak height of 10 units).” (p. 8 & 11).

      (5) Multicollinearity Between Variables:

      Lastly, the authors state on page 13: "Furthermore, it is expected that these explanatory variables will be correlated with one another. For this reason, we did not adopt a multiple regression approach to test the relationship between the information-theoretic variables and pupil response in a single model". However, the very purpose of multiple regression is to account for and disentangle the contributions of correlated predictors, no? I might have missed something here.

      We apologize for the ambiguity of our explanation in the Methods section. We originally sought to assess the overall relationship between the post-feedback response and information gain (primarily), but also surprise and entropy. Our reasoning was that these variables are often investigated in isolation across different experiments (i.e., only investigating Shannon surprise), and we would like to know what the pattern of results would look like when comparing a single information-theoretic variable to the pupil response (one-by-one). We assumed that including additional explanatory variables (that we expected to show some degree of collinearity with each other) in a regression model would affect variance attributed to them as compared with the one-on-one relationships observed with the pupil response (Morrissey & Ruxton 2018). We also acknowledge the value of a multiple regression approach on our data. Based on the suggestions by the reviewers we have included a complementary linear mixed model analysis in which we controlled for the effects of the information-theoretic variables on one another, while also including the nuisance regressors of pre-feedback baseline pupil dilation and reaction times.  

      This new analysis resulted in several additions to the Methods (see Section 2.5) and Results (see Tables 3 and 4). Overall, the results of the linear mixed model corroborated the “simple” correlation analysis across the pupil time course while accounting for the relationship to the prefeedback baseline pupil and preceeding reaction time differences. There was only one difference to note between the correlation and linear mixed modeling analyses: for the error trials in the cue-target 2AFC task, including entropy in the model accounted for the variance previously explained by surprise. 

      Reviewer #2 (Recommendations for the authors):

      (1) Given the inherent temporal dependencies in pupil dynamics, characterising later pupil responses as independent of earlier ones in a three-way repeated measures ANOVA may not be appropriate. A more suitable approach might involve incorporating the earlier pupil response as a covariate in the model.

      We thank the reviewer for bringing this issue to our attention. From our understanding, a repeated-measures ANOVA with factor “time window” would be appropriate in the current context for the following reasons. First, autocorrelation (closely tied to sphericity) is generally not considered a problem when only two timepoints are compared from time series data (Field, 2013; Tabachnick & Fidell, 2019). Second, the repeated-measures component of the ANOVA takes the correlated variance between time points into account in the statistical inference. Finally, as a complementary analysis, we present the results testing the interaction between the frequency and accuracy conditions across the full time courses (see Figures 1D and 2D); in these pupil time courses, any difference between the early and late time windows can be judged by the reader visually and qualitatively. 

      (2) Please clarify the correlations between KL divergence, surprise, entropy, and pupil response time series. Specifically, state whether these correlations account for the interrelationships between these information-theoretic measures. Given their strong correlations, partialing out these effects is crucial for accurate interpretation.

      As mentioned above, based on the suggestions by the reviewers we have included a complementary linear mixed model analysis in which we controlled for the effects of the information-theoretic variables on one another, while also including the nuisance regressors of pre-feedback baseline pupil dilation and reaction times.  

      This new analysis resulted in several additions to the Methods (see Section 2.5) and Results (see Tables 3 and 4). Overall, the results of the linear mixed model corroborated the “simple” correlation analysis across the pupil time course while accounting for the relationship to the prefeedback baseline pupil and preceeding reaction time differences. There was only one difference to note between the correlation and linear mixed modeling analyses: for the error trials in the cue-target 2AFC task, including entropy in the model accounted for the variance previously explained by surprise. 

      (3) The effects observed in the late time windows appear weak (e.g., Figure 2E vs. 2F, and the generally low correlation coefficients in Figure 3). Please elaborate on the reliability and potential implications of these findings.

      We have now included a complementary linear mixed model analysis which also provides insight into the amount of explained variance of these information-theoretic predictors on the post-feedback pupil response, while also including the pre-feedback baseline pupil and reaction time differences (see section 3.3, Tables 3 & 4). The R<sup>2</sup> values ranged from 0.16 – 0.50 across all conditions tested. Including the pre-feedback baseline pupil dilation as a predictor in the linear mixed model analysis consistently led to more explained variance in the post-feedback pupil response, as expected.  

      (4) In Figure 3 (C-J), please clarify how the trial-by-trial correlations were computed (averaged across trials or subjects). Also, specify how the standard error of the mean (SEM) was calculated (using the number of participants or trials).

      The trial-by-trial correlations between the pupil signal and model parameters were computed for each participant, then the coefficients were averaged across participants for statistical inference. We have added several clarifications in the text (see section 2.5 and legends of Figure 3 and Supplementary Figure 4).

      We have added “the standard error of the mean across participants” to all figure labels.

      (5) For all time axes (e.g., Figure 2D), please label the ticks at 0, 0.5, 1, 1.5, 2, 2.5, and 3 seconds. Clearly indicate the duration of the feedback on the time axes. This is particularly important for interpreting the pupil dilation responses evoked by auditory feedback.

      We have labeled the x-ticks every 0.5 seconds in all figures and indicated the duration of the auditory feedback in the letter-color decision task and as well as the stimuli presented in the control tasks in the Supplementary Materials. 

      Reviewer #3 (Recommendations for the authors):

      (1) Introduction page 3: "In information theory, information gain quantifies the reduction of uncertainty about a random variable given the knowledge of another variable. In other words, information gain measures how much knowing about one variable improves the prediction or understanding of another variable."

      (2) In my opinion, the description of information gain can be clarified. Currently, it is not very concrete and quite abstract. I would recommend explaining it in the context of belief updating.

      We have removed these unclear statements in the Introduction. We now clearly state the following:

      “Information gain can be operationalized within information theory as the KullbackLeibler (KL) divergence between the posterior and prior belief distributions of a Bayesian observer, representing a formalized quantity that is used to update internal models [29,79,80].” (p. 4)

      (3) Page 4: The inconsistencies across studies are described in extreme detail. I recommend shortening this part and summarizing the inconsistencies instead of listing all of the findings separately.

      As per the reviewer’s recommendation, we have shortened this part of the introduction to summarize the inconsistencies in a more concise manner as follows: 

      “Previous studies have shown different temporal response dynamics of prediction error signals in pupil dilation following feedback on decision outcome: While some studies suggest that the prediction error signals arise around the peak (~1 s) of the canonical impulse response function of the pupil [11,30,41,61,62,90], other studies have shown evidence that prediction error signals (also) arise considerably later with respect to feedback on choice outcome [10,25,32,41,62]. A relatively slower prediction error signal following feedback presentation may suggest deeper cognitive processing, increased cognitive load from sustained attention or ongoing uncertainty, or that the brain is integrating multiple sources of information before updating its internal model. Taken together, the literature on prediction error signals in pupil dilation following feedback on decision outcome does not converge to produce a consistent temporal signature.” (p. 5)

      We would like to note some additional minor corrections to the preprint:

      We have clarified the direction of the effect in Supplementary Figure 3 with the following: 

      “Participants who showed a larger mean difference between the 80% as compared with the 20% frequency conditions in accuracy also showed smaller differences (a larger mean difference in magnitude in the negative direction) in pupil responses between frequency conditions (see Supplementary Figure 4).”

      The y-axis labels in Supplementary Figure 3 were incorrect and have been corrected as the following: “Pupil responses (80-20%)”.

      We corrected typos, formatting and grammatical mistakes when discovered during the revision process. Some minor changes were made to improve clarity. Of course, we include a version of the manuscript with Tracked Changes as instructed for consideration.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1(Public review):

      In this manuscript, Pagano and colleagues test the idea that the protein GMCL1 functions as a substrate receptor for a Cullin RING 3 E3 ubiquitin ligase (CUL3) complex. Using a pulldown approach, they identify GMCL1 binding proteins, including the DNA damage scaffolding protein 53BP1. They then focus on the idea that GMCL1 recruits 53BP1 for CUL3-dependent ubiquitination, triggering subsequent proteasomal degradation of ubiquitinated 53BP1.

      In addition to its DNA damage signalling function, in mitosis, 53BP1 is reported to form a stopwatch complex with the deubiquitinating enzyme USP28 and the transcription factor p53 (PMID: 38547292). These 53BP1-stopwatch complexes generated in mitosis are inherited by G1 daughter cells and help promote p53-dependent cell cycle arrest independent from DNA damage (PMID: 38547292). Several studies show that knockout of 53BP1 overcomes G1 cell cycle arrest after mitotic delays caused by anti-mitotic drugs or centrosome ablation (PMID: 27432897, 27432896). In this model, it is crucial that 53BP1 remains stable in mitosis and more stopwatch complex is formed after delayed mitosis.

      Major concerns:

      Pagano and coworkers suggest that 53BP1 levels can sometimes be suppressed in mitosis if the cells overexpress GMCL1. They carry out a bioinformatic analysis of available public data for p53 wild-type cancer cell lines resistant to the anti-mitotic drug paclitaxel and related compounds. Stratifying GMCL1 into low and high expression groups reveals a weak (p = 0.05 or ns) correlation with sensitivity to taxanes. It is unclear on what basis the authors claim paclitaxel-resistant and p53 wild-type cancer cell lines bypass the mitotic surveillance/timer pathway. They have not tested this. Figure 3 is a correlation assembled from public databases but has no experimental tests. Figure 4 looks at proliferation but not cell cycle progression or the length of mitosis. The main conclusions relating to cell cycle progression and specifically the link to mitotic delays are therefore not supported by experimental data. There is no imaging of the cell cycle or cell fate after mitotic delays, or analysis of where the cells arrest in the cell cycle. Most of the cell lines used have been reported to lack a functional mitotic surveillance pathway in the recent work by Meitinger. To support these conclusions, the stability of endogenous 53BP1 under different conditions in cells known to have a functional mitotic surveillance pathway needs to be examined. A key suggestion in the work is that the level of GMCL1 expression correlates with resistance to taxanes. For the mitotic surveillance pathway, the type of drug (nocodazole, taxol, etc) used to induce a delay isn't thought to be relevant, only the length of the delay. Do GMCL1-overexpressing cells show resistance to anti-mitotics in general?

      We thank the reviewer for this insightful comment. We propose that GMCL1 promotes CUL3-dependent ubiquitination of 53BP1 during prolonged mitotic arrest, thereby facilitating its proteasome-dependent degradation. To evaluate the potential clinical relevance of this mechanism, we stratified cancer cell lines based on GMCL1 mRNA expression using publicly available datasets from DepMap (PMID: 39468210). We observed correlations between GMCL1 expression levels and taxane sensitivity that appear to reflect specific cancer type-drug combinations. To experimentally evaluate this correlation and obtain mechanistic insights, we performed knockdown experiments in hTERT-RPE1 cells, which are known to possess an intact mitotic surveillance pathway. Silencing of GMCL1 alone inhibited cell proliferation and induced apoptosis, while co-depletion of either TP53BP1 or USP28 significantly rescued these effects. These results suggest that GMCL1 modulates the stability of 53BP1 and therefore the availability of the 53BP1-USP28-p53 ternary complex in cells with a functional mitotic surveillance pathway (MSP) (new Figure 5I,J) directly linking GMCL1 to the regulation of the MSP complex. Moreover, to further support our mechanism, we assessed the effect of GMCL1 levels on cell cycle progression. Briefly, following nocodazole synchronization and release, we treated cells with EdU and performed FACS analyses at different times. Knockdown of GMCL1 alone led to a delayed cell cycle progression, but co-depletion of either TP53BP1 or USP28 restored this phenotype (new Figure 3A and new Supplementary Figure 3A-C). These results are consistent with our proliferation data and suggest that the observed effects of GMCL1 are specific to mitotic exit. Finally, overexpression of GMCL1 accelerates cell cycle progression (as assessed by FACS analyses) upon release from prolonged mitotic arrest (new Figure 3B and new Supplementary Figure 3D-E). 

      Importantly, if GMCL1 specifically degrades 53BP1 during prolonged mitotic arrests, the authors should show what happens during normal cell divisions without any delays or drug treatments. How much 53BP1 is destroyed in mitosis under those conditions? Does 53BP1 destruction depend on the length of mitosis, drug treatment, or does 53BP1 get degraded every mitosis regardless of length? Testing the contribution of key mitotic E3 ligase activities on mitotic 53BP1 stability, such as the anaphase-promoting complex/cyclosome (APC/C) is important in this regard. One previous study reported an analysis of putative APC/C KEN-box degron motifs in 53BP1 and concluded these play a role in 53BP1 stability in anaphase (PMID: 28228263).

      Physiological mitosis under unperturbed conditions is typically brief (approximately 30 minutes), making protein quantification during this window challenging. Despite this, we tried by synchronizing cells using RO-3306 and releasing them into drug-free medium to assess GMCL1 dynamics during normal mitosis. Under these conditions, GMCL1 expression was similar to that in asynchronous cells and higher than the levels upon extended mitosis. However, when we attempted to measure the half-life of proteins using cycloheximide, most cells died, likely due to the toxic effect of cycloheximide in cells subjected to co-treatment with RO-3306 or nocodazole. This is the same reasons why in Figure 2C, we assessed 53BP1 in daughter cells rather than mitotic cells. 

      There is no direct test of the proposed mechanism, and it is therefore unclear if 53BP1 is ubiquitinated by a GMCL1-CUL3 ligase in cells, and how efficient this process would be at different cell cycle stages. A key issue is the lack of experimental data explaining why the proposed mechanism would be restricted to mitosis. Indirect effects, such as loss of 53BP1 from the chromatin fraction during M phase upon GMCL1 overexpression, do not necessarily mean that 53BP1 is degraded. PLK1-dependent chromatin-cytoplasmic shuttling of 53BP1 during mitotic delays has been described previously (PMID: 38547292, 37888778). These papers are cited in the text, but the main conclusions of those papers on 53BP1 incorporation into a stopwatch complex during mitotic delays have been ignored. Are the authors sure that 53BP1 is destroyed in mitosis and not simply re-localised between chromatin and non-chromatin fractions? At the very least, these reported findings should be discussed in the text.

      To examine whether GMCL1 promotes 53BP1 ubiquitination in cells, we expressed in cells Trypsin-Resistant Tandem Ubiquitin-Binding Entity (TR-TUBE), a protein that binds polyubiquitin chains. Abundant, endogenous ubiquitinated 53BP1 co-precipitated with TR-TUBE constructs only when wild-type GMCL1 but not the E142K GMCL1 mutant, was expressed (new Figure 2D).  The PLK1-dependent incorporation of 53BP1 into the stopwatch complex and the chromatin-cytoplasmic shuttling of 53BP1 during mitotic delays is now discussed in the text. That said, compared to parental cells, 53BP1 levels in the chromatin fraction are high in two different GMCL1 KO clones in M phase arrested cells (Figure 2A-B).  This increase does not correspond to a decrease in the 53BP1 soluble fraction (Figure 2A and new Supplementary Figure 2D), suggesting decreased 53BP1 is not due to re-localization. The increased half-life of 53BP1 in daughter cells (Figure 2C), also supports this hypothesis. 

      The authors use a variety of cancer cell line models throughout their study, most of which have been reported to lack a functional mitotic surveillance pathway. U2OS and HCT116 cells do not respond normally to mitotic delays, despite being annotated as p53 WT. Other studies have used p53 wild-type hTERT RPE-1 cells to study the mitotic surveillance pathway. If the model is correct, then over-expressing GMCL1 in hTERT-RPE1 cells should suppress cell cycle arrest after mitotic delays, and GMCL1 KO should make the cells more sensitive to delays. These experiments are needed to provide an adequate test of the proposed model.

      We greatly appreciate the reviewer’s suggestion regarding overexpression of GMCL1 in hTERT-RPE1 cells. To address this, we generated stable RPE1 cells expressing V5-tagged GMCL1 and conducted EdU incorporation assays following nocodazole synchronization and release. Overexpression of GMCL1 enhanced cell cycle progression compared to control cells (new Figure 3B and new Supplementary Figure 3D-E) after mitotic arrest, consistent with our model. We, therefore, propose that GMCL1 controls 53BP1 stability to suppress p53-dependent cell cycle arrest.

      We also want to point out that while some papers suggest that HCT116 and U2OS cells do not have an intact mitotic surveillance pathway, others have shown that the MSP is indeed functioning in HCT116 cells and can be triggered with variable efficiency in U2OS cells (PMID: 38547292). This is likely due to high heterogeneity and extensive clonal diversity of cancer cell lines grown in different labs. Please see examples in PMIDs: 3620713, 30089904, and 30778230. In particular, PMID: 30089904 shows that this heterogeneity correlates with considerably different drug responses. 

      To conclude, while the authors propose a potentially interesting model on how GMCL1 overexpression could regulate 53BP1 stability to limit p53-dependent cell cycle arrest, it is unclear what triggers this pathway or when it is relevant. 53BP1 is known to function in DNA damage signalling, and GMCL1 might be relevant in that context. The manuscript contains the initial description of GMCL1-53BP1 interaction but lacks a proper analysis of the function of this interaction and is therefore a preliminary report.

      We hope that the new experiments, along with the clarifications provided in this response letter and revised manuscript, offer the reviewer increased confidence in the robustness and validity of our proposed model.

      Reviewer #2 (Public review):

      This study investigates the role of GMCL1 in regulating the mitotic surveillance pathway (MSP), a protective mechanism that activates p53 following prolonged mitosis. The authors identify a physical interaction between 53BP1 and GMCL1, but not with GMCL2. They propose that the ubiquitin ligase complex CRL3-GMCL1 targets 53BP1 for degradation during mitosis, thereby preventing the formation of the "mitotic stopwatch" complex (53BP1-USP28-p53) and subsequent p53 activation. The authors show that high GMCL1 expression correlates with resistance to paclitaxel in cancer cell lines that express wild-type p53. Importantly, loss of GMCL1 restores paclitaxel sensitivity in these cells, but not in p53-deficient lines. They propose that GMCL1 overexpression enables cancer cells to bypass MSP-mediated p53 activation, promoting survival despite mitotic stress. Targeting GMCL1 may thus represent a therapeutic strategy to re-sensitize resistant tumors to taxane-based chemotherapy.

      Strengths:

      This manuscript presents potentially interesting observations. The major strength of this article is the identification of GMCL1 as a 53BP1 interaction partner. The authors identified relevant domains and showed that GMCL1 controls 53BP1 stability. The authors further show a potentially interesting link between GMCL1 status and sensitivity to Taxol.

      Weaknesses:

      However, the manuscript is significantly weakened by unsubstantiated mechanistic claims, overreliance on a non-functional model system (U2OS), and overinterpretation of correlative data. To support the conclusions of the manuscript, the authors must show that the GMCL1-dependent sensitivity to Taxol depends on the mitotic surveillance pathway.

      To demonstrate that GMCL1-dependent taxane sensitivity is mediated through the mitotic surveillance pathway (MSP), we now performed experiments using hTERT-RPE1 (RPE1) cells, a widely used, non-transformed cell line known to possess a functional MSP.  We compared RPE1 cells with knockdown of GMCL1 alone to those with simultaneous knockdown of GMCL1 and either TP53BP1 or USP28. Upon paclitaxel (Taxol) treatment, cells with GMCL1 knockdown exhibited suppressed proliferation and increased apoptosis. Notably, these phenotypes were rescued by co-depletion of TP53BP1 or USP28 (new Figure 5I,J). These results support the notion that GMCL1 contributes to MSP activity, at least in part, through its regulation of 53BP1.       

      To further strengthen our mechanistic experiments, we assessed the effect of GMCL1 levels on cell cycle progression. Following nocodazole synchronization and release, we treated cells with EdU and performed FACS analyses at different times. Knockdown of GMCL1 alone led to a delay in cell cycle progression, but co-depletion of either TP53BP1 or USP28 alleviate this phenotype (new Figure 3A and new Supplementary Figure 3A, B). These results are consistent with our proliferation data.

      Reviewer #3 (Public review):

      Summary:

      In this study, Kito et al follow up on previous work that identified Drosophila GCL as a mitotic substrate recognition subunit of a CUL3-RING ubiquitin ligase (CRL3) complex.

      Here they characterize mutants of the human ortholog of GCL, GMCL1, that disrupt the interaction with CUL3 (GMCL1E142K) and that lack the substrate interaction domain (GMCL1 BBO). Immunoprecipitation followed by mass spectrometry identified 9 proteins that interacted with wild-type FLAG-GMCL1 and GMCL1 EK but not GMCL1 BBO. These proteins included 53BP1, which plays a well-characterized role in double-strand break repair but also functions in a USP28-p53-53BP1 "mitotic stopwatch" complex that arrests the cell cycle after a substantially prolonged mitosis. Consistent with the IP-MS results, FLAG-GMCL1 immunoprecipitated 53BP1. Depletion of GMCL1 during mitotic arrest increased protein levels of 53BP1, and this could be rescued by wild-type GMCL1 but not the E142K mutant or a R433A mutant that failed to immunoprecipitate 53BP1.

      Using a publicly available dataset, the authors identified a relatively small subset of cell lines with high levels of GMCL1 mRNA that were resistant to the taxanes paclitaxel, cabazitaxel, and docetaxel. This type of analysis is confounded by the fact that paclitaxel and other microtubule poisons accumulate to substantially different levels in various cell lines (DOI: 10.1073/pnas.90.20.9552 , DOI: 10.1091/mbc.10.4.947 ), so careful follow-up experiments are required to validate results. The correlation between increased GMCL1 mRNA and taxane resistance was not observed in lung cancer cell lines. The authors propose this was because nearly half of lung cancers harbor p53 mutations, and lung cancer cell lines with wild-type but not mutant p53 showed the correlation between increased GMCL1 mRNA and taxane resistance. However, the other cancer cell types in which they report increased GMCL1 expression correlates with taxane sensitivity also have high rates of p53 mutation. Furthermore, p53 status does not predict taxane response in patients (DOI: 10.1002/1097-0142(20000815)89:4<769::aid-cncr8>3.0.co;2-6 , DOI: 10.1002/(SICI)1097-0142(19960915)78:6<1203::AID-CNCR6>3.0.CO;2-A , PMID: 10955790).

      The authors then depleted GMCL1 and reported that it increased apoptosis in two cell lines with wild-type p53 (MCF7 and U2OS) due to activation of the mitotic stopwatch. This is surprising because the mitotic stopwatch paper they cite (DOI: 10.1126/science.add9528 ) reported that U2OS cells have an inactive stopwatch and that activation of the stopwatch results in cell cycle arrest rather than apoptosis in most cell types, including MCF7. Beyond this, it has recently been shown that the level of taxanes and other microtubule poisons achieved in patient tumors is too low to induce mitotic arrest (DOI: 10.1126/scitranslmed.3007965 , DOI: 10.1126/scitranslmed.abd4811 , DOI: 10.1371/journal.pbio.3002339 ), raising concerns about the relevance of prolonged mitosis to paclitaxel response in cancer. The findings here demonstrating that GMCL1 mediates degradation of 53BP1 during mitotic arrest are solid and of interest to cell biologists, but it is unclear that these findings are relevant to paclitaxel response in patients.

      Strengths:

      This study identified 53BP1 as a target of CRL3GMCL1-mediated degradation during mitotic arrest. AlphaFold3 predictions of the binding interface, followed by mutational analysis, identified mutants of each protein (GMCL1 R433A and 53BP1 IEDI1422-1425AAAA) that disrupted their interaction. Knock-in of a FLAG tag into the C-terminus of GMCL1 in HCT116 cells, followed by FLAG immunoprecipitation, confirmed that endogenous GMCL1 interacts with endogenous CUL3 and 53BP1 during mitotic arrest.

      Weaknesses:

      The clinical relevance of the study is overinterpreted. The authors have not taken relevant data about the clinical mechanism of taxanes into account. Supraphysiologic doses of microtubule poisons cause mitotic arrest and can activate the mitotic stopwatch. However, in physiologic concentrations of clinically useful microtubule poisons, cells proceed through mitosis and divide their chromosomes on mitotic spindles that are at least transiently multipolar. Though these low concentrations may result in a brief mitotic delay, it is substantially shorter than the arrest caused by high concentrations of microtubule poisons, and the one mimicked here by 16 hours of 0.4 mg/mL nocodazole, which is not used clinically and does not induce multipolar spindles. Resistance to mitotic arrest occurs through different mechanisms than resistance to multipolar spindles. No evidence is presented in the current version of the manuscript that GMCL1 affects cellular response to clinically relevant doses of paclitaxel.

      We agree that it would be an overstatement to claim that GMCL1 and p53 regulates paclitaxel sensitivity in cancer patients in a clinical context. The correlations we observed were based on publicly available cancer cell lines from datasets catalogued in CCLE and DepMap, which do not fully account for clinical heterogeneity and patient-specific factors. In response to this important point, we have revised the text accordingly. 

      In the experiments shown in former Figure 4A-H (now Figure 5A-H) and in those shown in the new Figure 5I-J, we used 100 nM paclitaxel to test the hypothesis that low GMCL1 levels sensitizes cancer cells in a p53-dependent manner. Here, paclitaxel was chosen to mimic the conditions reported in the PRISM dataset (PMID: 32613204), which compiles the proliferation inhibitory activity of 4,518 compounds tested across 578 cancer cell lines. Consistent with our cell cycle findings, the paclitaxel sensitivity caused by GMCL1 depletion was reverted by silencing 53BP1 or USP28 (new Figure 5I-J), again supporting the involvement of the stopwatch complex. We are unsure about how to model the “physiologic concentrations of clinically useful microtubule poisons” in cell-based studies. A recent review notes that “The time above a threshold paclitaxel plasma concentration (0.05 mmol/L) is important for the efficacy and toxicity of the drug” (PMID: 28612269).  Two other reviews mention that the clinically relevant concentration of paclitaxel is considered to be plasma levels between 0.05–0.1 μmol/L (approximately 50–100 nM) and that in clinical dosing, typical patient plasma concentrations after paclitaxel infusion range from 80–280 nM, with corresponding intratumoral concentrations between 1.1–9.0 μM, due to drug accumulation in tumor tissue (PMIDs: 24670687 and  29703818).  We have now emphasized in the revised text the rationale for using 100 nM paclitaxel in our experiments.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      General comments on the Figures:

      (1) Western blots lack molecular weight markers on most panels and are often over-exposed and over-contrasted, rendering them hard to interpret.

      We have now included molecular weight markers in all Western blot panels. We have also reprocessed the images to avoid overexposure and excessive contrast, ensuring that the bands are clearly visible and interpretable.

      (2) Input and IP samples do not show percentage loading, so it is hard to interpret relative enrichments.

      In the revised figures, we have indicated what % of the input was loaded.

      (3) The authors change between cell line models for their experiments, and this is not clear in the figures. These are important details for interpreting the data, as many of the cell lines used are not functional for the mitotic surveillance pathway.

      In the revised manuscript, we have clearly indicated the specific cell lines used in each experiment in the figure legends. Additionally, to address concerns regarding the mitotic surveillance pathway, we have included new experiments using hTERT-RPE1 cells, which have been reported to possess a functional mitotic surveillance pathway (MSP) (Figure 4I-J).

      (4) No n-numbers are provided in the figure legends. Are the Western blots provided done once, or are they reproducible? Many of the blots would benefit from quantification and presentation via graphs to test for reproducible changes to 53BP1 levels under the different conditions.

      As now indicated in the methods section, we have conducted each Western blot no less than three times, yielding results that exhibit a high degree of reproducibility. A representative Western blot has been selected for each figure. We did not include densiometric quantification of immunoblots, given that the semi-quantitative nature of this technique would lead to an overinterpretation of our data; unfortunately, this is a limitation of the technique. In fact, eLife and other similar scientific journals do not adhere to the practice of quantifying Western blots. One exception to this norm is for protein half-life studies, which is done to measure the kinetics of decay rates and their internal comparisons. Accordingly, the experiments in Figure 2C were quantified.

      (5) Graphs displayed in the supplementary figures are blacked out, and individual data points cannot be visualised. All graphs should have individual data points clearly visible.

      We revised the quantified graphs and replaced them with scatter plots to clearly display individual data points, showing sample distribution.

      Additional experiments with specific comments on Figures:

      (1) Figure 1C-D: the relative amount of 53BP1 co-precipitating with FLAG-tagged GMCL1 WT appears very different between the two experiments. If the idea is that MLN4924 (Cullin neddylation inhibitor) makes the interaction easier to capture, then this should be explained in the text, and ideally shown on the same gel/blot -/+ MLN4924.

      We now present the samples treated with and without MLN4924 on the same gel/blot to allow direct comparison (new Figure 1D) and clarified this point in the text.

      (2) Figure 1E: The figure legend states that GMCL1 was immunoprecipitated, but the Figure looks as though FLAG-tagged 53BP1 was the bait protein being immunoprecipitated? Can the authors clarify?

      We thank the reviewer for pointing out the discrepancy between the figure and the figure legend in Figure 1E. The immunoprecipitation was indeed performed using FLAG-tagged 53BP1, and we have now rectified the figure legend accordingly. 

      (3) Figure 1F: Rather than parental cell lysate, the better control would be to IP FLAG from another FLAG-tagged expressing cell line, to rule out non-specific binding with the FLAG tag at the non-overexpressed level. 

      Figure 1F shows interaction at the endogenous level. The specificity of binding with overexpressed proteins is shown in Figures 1C and 1D.

      The USP28 blot is over-exposed and makes it hard to see any changes in electrophoretic mobility - it looks as though there is a change between the parental and the KI cell line? It is surprising that USP28 would co-IP with GMCL1 (presumably because USP28 is bound to 53BP1) if the function of GMCL1-53BP1 interaction is to promote 53BP1 degradation. Can the authors reconcile this? Crucially, if the authors claim that the 53BP1-GMCL1 interaction is specific to prolonged mitosis, then this experiment should be repeated and performed with asynchronous, normal-length mitosis, and prolonged mitosis conditions. This is vital for supporting the claim that this interaction only occurs during prolonged mitoses and does not occur in every mitosis regardless of length.

      This is a good point. Unfortunately, many of the protein-protein interactions occur post lysis. Therefore, we could not observe differences in asynchronous vs. mitotic cells.

      (4) Figure S1F: Label on blot should be CUL3 not CUI3.

      We thank the reviewer for pointing this out and we have corrected the typo.

      (5) Figure 2A: The authors suggest an increase in chromatin-bound 53BP1 in GMCL1 KO U2OS cells, specifically in M phase. Again, is this time in mitosis dependent, or would this be evident in every mitosis, regardless of length? Such an experiment would benefit from repetition and quantification to test whether the observed effect is reproducibly consistent. If the authors' model is correct, simply treating U2OS WT mitotic cells with MG132 during the mitotic arrest and performing the same fractionation should bring 53BP1 levels up to that seen in GMCL1 KO cells under the same conditions.

      The reviewer’s suggestion to assess 53BP1 accumulation in wild-type U2OS cells treated with MG132 during mitotic arrest is indeed highly relevant. However, treatment with MG132 during prolonged mitosis consistently led to significant cell death, making it technically challenging to evaluate 53BP1 levels under these conditions.

      (6) Figure 2B: The authors restore GMCL1 expression in the KO U2OS cells using WT and 2 distinct mutant cDNAs. However, the expression of these constructs is not equivalent, and thus their effects cannot be directly compared. It is also surprising that GMCL1 is much higher in M phase samples in this experiment (shouldn't it be destroyed?), when no such behaviour has been observed in the other figures.

      There is no evidence in our study or others that GMCL1 should be destroyed in M phase.  We show that the R433A mutant is expressed at a level very similar to the WT protein, yet it doesn’t promote the degradation of 53BP1. It is true that the E142K is expressed less in mitotic cells whereas is the most expressed in asynchronous cells. For some reason, this mutant has an inverse behavior compared to the WT, limiting the interpretation of this result. We now mention this in the text. 

      (7) Figure 2C: The CHX experiment would benefit from inclusion of a control protein known to have a short half-life (e.g. c-myc, p53). Is GMCL1 known to have a relatively short half-life? It looks as though GMCL1 disappears after 1 h CHX treatment (although hard to definitively tell in the absence of molecular weight markers). 53BP1 appears to continue declining in the absence of GMCL1, which is surprising if p53BP1 degradation requires GMCL1. How can the authors reconcile this?

      As a control for the CHX chase experiments, we included p21, whose protein levels decreased in a CHX-dependent. GMCL1 itself also appeared to undergo degradation upon CHX treatment, but it doesn’t disappear completely.

      (8) Supplemental Figure 2:

      Transcription is largely inhibited in M phase, so the p53 target gene transcripts present in M phase are inherited from the preceding G2 phase. The qPCR's thus need a reference sample to compare against. I.e., was p21/PUMA/NOXA mRNA already low in G2 in the GMCL1 KO + WT cells before they entered mitosis? Or is the mRNA stability affected during M phase specifically? Is this effect on the mRNA dependent on the time in mitosis?

      It is well established that transcription is not entirely shut down during mitosis, particularly for a subset of genes involved in cell cycle regulation. For example, p21, PUMA, NOXA, and p53 mRNAs have been shown to remain actively transcribed during mitosis (see Table S5 in PMID: 28912132). However, we currently lack direct evidence that p53 activation during mitosis, specifically through the mitotic surveillance pathway, drives the transcription of p21, PUMA, or NOXA mRNAs during M phase. In the absence of such mechanistic data, we opted to exclude these analyses from the final figures.

      Panel B: blots are too over-exposed to see differences in p53 stability under the different conditions. Mitotic samples should be included to show how these differ from the G1 samples.

      The background of all blot images has been adjusted to ensure clarity and consistency.

      Panel D: The authors show no significant difference in the cell cycle profiles of the GMCL1 KO and reconstituted cells compared to parental U2OS cells. This should also be performed in the G1 daughter cells following a prolonged mitosis, to test the effect of the different GMCL1 constructs on G1 cell cycle arrest. U2OS cells have been reported not to have a functional mitotic surveillance pathway (Meitinger et al, Science, 2024), so U2OS cells are perhaps not a good model for testing this.

      We performed cell cycle profiling using EdU incorporation in hTERT-RPE1 cells, which possess a functional MSP, to evaluate cell cycle progression in daughter cells following prolonged mitosis. We observed that GMCL1 knockdown alone leads to G1-phase arrest. In contrast, co-depletion of GMCL1 with either 53BP1 or USP28 bypasses this arrest, indicating that GMCL1 regulates cell cycle progression in an MSP-dependent manner. Please see also the answer to the public review above. 

      (9) Figure 3:

      The authors show expression data for GMCL1 in the different cancer cell lines. This should be validated for a subset of cancer cell lines at the GMCL1 protein level, and cross-correlated to their MSP/mitotic timer status. Does GMCL1 depletion or knockout in p53 wild-type cancer cell lines overexpressing GMCL1 protein restore mitotic surveillance function?

      We were unable to assess GMCL1 protein levels using publicly available proteomics datasets, as GMCL1 expression was not detected. In p53 wild-type hTERT-RPE1 cells, GMCL1 knockdown impaired the mitotic surveillance pathway, as evidenced by G1-phase arrest following prolonged mitosis (new Figure 3A and new Supplementary Figure 3A, B). This arrest was rescued by co-depletion of either TP53BP1 or USP28, indicating that GMCL1 acts upstream of the MSP.

      (10) Figure 4:

      The authors show siRNA experiments depleting GMCL1 and testing the effects of GMCL1 loss on cell viability and apoptosis induction. This is performed in different cell line backgrounds. However, there is no demonstration that any of the observed effects are due to a lack of GMCL1 activity on 53BP1. These experiments need to be repeated in 53BP1 co-depleted cells to test for rescue. Without this, the interpretation is purely correlative.

      We assessed the effects of GMCL1 knockdown, alone or in combination with TP53BP1 or USP28 knockdown, on cell viability and apoptosis in hTERT-RPE1 cells using siRNA. Knockdown of GMCL1 alone led to a significant reduction in cell viability and an increase in apoptosis. However, co-depletion of GMCL1 with either TP53BP1 or USP28 restored both cell viability and apoptosis levels to those observed in control cells (new Figure 5I,J).

      (11) Text comments:

      Line 257: HeLa cells supress p53 through the E6 viral protein and are not "mutant" for p53.

      The authors should cite early work by Uetake and Sluder describing the effects of spindle poisons on the mitotic surveillance pathway.

      We appreciate the reviewer’s comments – We have now made the necessary corrections.

      Reviewer #2 (Recommendations for the authors):

      Major Points:

      (1) Unsubstantiated Mechanistic Claims:

      In Figures 3 and 4, the authors show correlations between GMCL1 expression and sensitivity to Taxol. However, they fail to demonstrate that the mitotic stopwatch is mechanistically involved. To support this conclusion, the authors must test whether deletion of 53BP1, USP28, or disruption of their interaction rescues Taxol sensitivity in GMCL1-depleted cells. Since 53BP1 also plays a role in DNA damage response, such rescue experiments are necessary to distinguish between mitotic surveillance-specific and broader stress-response effects. Deletion of USP28 would be particularly informative.

      We sought to experimentally determine whether GMCL1 is involved in regulating the mitotic stopwatch. Knockdown of GMCL1 alone resulted in reduced cell proliferation and increased apoptosis. In contrast, co-depletion of GMCL1 with either TP53BP1 or USP28 restored both proliferation and apoptosis levels to those observed in control cells (new Figure 5I, J). To further strengthen our mechanistic experiments, we assessed the effect of GMCL1 levels on cell cycle progression. We conducted EdU incorporation assays following nocodazole synchronization and release. Knockdown of GMCL1 alone led to a delay in G1 progression, whereas co-depletion of either TP53BP1 or USP28 rescued normal cell cycle progression (new Figure 3A and new Supplementary Figure 3A, B). These results are consistent with our proliferation data and suggest that GMCL1 functions upstream of the ternary complex, likely by regulating 53BP1 protein levels.

      (2) Model System Limitations (U2OS Cells):

      The use of U2OS cells is highly problematic for investigating the mitotic surveillance pathway. U2OS cells lack a functional mitotic stopwatch and do not arrest following prolonged mitosis in a 53BP1/USP28-dependent manner (PMID: 38547292). Therefore, conclusions drawn from this model system about the function of the mitotic surveillance pathway are not substantiated. Key experiments should be repeated in a cell line with an intact pathway, such as RPE1.

      We now performed all key experiments also hTERT-RPE1 cells (see above). We also would like to point out that while some papers suggest that HCT116 and U2OS cells do not have an intact mitotic surveillance pathway, others have showed that the MSP is indeed functioning in HCT116 cells and can be triggered with variable efficiency in U2OS cells (PMID: 38547292).  This is likely due to high heterogeneity and extensive clonal diversity of cancer cell lines grown in different labs. Please see examples in PMIDs: 3620713, 30089904, and 30778230. In particular, PMID: 30089904 shows that this heterogeneity correlates with considerably different drug responses. 

      (3) Misinterpretation of p53 Activity Timing:

      The manuscript states that "GMCL1 KO cells led to decreased mRNA levels of p21 and NOXA during mitosis" (line 194). However, it is well established that the mitotic surveillance pathway activates p53 in the G1 phase following prolonged mitosis-not during mitosis itself (PMID: 38547292). Therefore, the observed changes in mRNA levels during mitosis are unlikely to be relevant to this pathway.

      We currently lack direct evidence that p53 activated during mitosis through the mitotic surveillance pathway directly influences the transcription of p21, PUMA, or NOXA mRNAs during M phase. Therefore, we have chosen to exclude these data from the final figures.

      (4) Incorrect Interpretation of 53BP1 Chromatin Binding:

      The authors claim that 53BP1 remains associated with chromatin during mitosis, which contradicts established literature. It is known that 53BP1 is released from chromatin during mitosis via mitosis-specific phosphorylation (PMID: 24703952), and this is supported by more recent findings (PMID: 38547292). A likely explanation for the discrepancy may be contamination of mitotic fractions with interphase cells. The chromatin fraction data in Figure 2C must be interpreted with caution.

      Our method to synchronize in M phase is rather stringent (see Supplementary Figure 3D as an example). The literature indicates that the bulk of 53BP1 is released from chromatin during mitosis. Yet, even in the two publications mentioned by the reviewer, there is a difference in the observable amount of 53BP1 bound to chromatin (compare Figure 2B in PMID: 38547292 and Figure 5A in PMID: 24703952). The difference is likely due to the different biochemical approaches used to purify chromatin bound proteins (salt and detergent concentrations, sonication, etc.). Using our fractionation approach, we can reliably separate the soluble fraction (containing also the nucleoplasmic fraction) and chromatin associated proteins as indicated by the controls such as a-Tubulin and Histon H3.  We have now mentioned these limitations when comparing different fractionation methods in our discussion section.

      (5) Inadequate Citation of Foundational Literature:

      The literature on the mitotic surveillance pathway is relatively limited, and it is essential that the authors provide a comprehensive and accurate account of its development. The foundational work by the Sluder lab (PMID: 20832310), demonstrating a p53-dependent arrest following prolonged mitosis, must be cited. Furthermore, the three key 2016 papers (PMID: 27432896, 27432897, 27432896) that identified the involvement of USP28 and 53BP1 in this pathway are critical and should be cited as the basis of the mitotic surveillance pathway.

      In contrast, the manuscript currently emphasizes publications that either contribute minimally or have been contradicted by prior and subsequent work. For example: PMID: 31699974, which proposes Ser15 phosphorylation of p53 as critical, has been contradicted by multiple groups (e.g., Holland, Oegema, and Tsou labs).

      PMID: 37888778, which suggests that 53BP1 must be released from kinetochores, is inconsistent with findings that indicate kinetochore localization is not relevant.

      The authors should thoroughly revise the Introduction to reflect what this reviewer would describe as a more accurate and scholarly approach to the literature.

      We have substantially revised both the Introduction and Discussion sections to incorporate important references kindly suggested by the reviewer.

      Minor Points:

      (1) Overexposed Western Blots:

      The Western blots throughout the manuscript are heavily overexposed and saturated, obscuring differences in protein levels and hindering data interpretation. The authors should provide properly exposed blots with quantification where appropriate.

      We have provided Western blot images with appropriate exposure levels and included quantification where appropriate (i.e., to measure the kinetics of decay rates as in Figure 2C). For all the other immunoblots, we did not include densiometric quantification, given that the semi-quantitative nature of this technique would lead to overinterpretation of our data. This is, unfortunately, a limitation of the technique. In fact, eLife and other similar scientific journals do not adhere to the practice of quantifying Western blot analyses. 

      (2) Missing information in the graphs in Figure 2C and 4; S2? How many repeats? What are the asterisks?

      Panels referenced above have been repeated several times, and further details are now provided in the figure legends.

      Reviewer #3 (Recommendations for the authors):

      (1)   The claim that GMCL1 modulates paclitaxel sensitivity in cancer should be toned down

      .

      We agree that it would be an overstatement to claim that GMCL1 regulates paclitaxel sensitivity in cancer patients in a clinical context. The correlations we observed were based on publicly available, cell line–based datasets, which do not fully account for clinical heterogeneity and patient-specific factors. In response to this important point, we have revised our statements and corresponding text accordingly. We now placed greater emphasis on our molecular and cell biology studies.

      (2) Additional experiments in low, physiologically relevant concentrations of paclitaxel would be interesting. It is possible that these concentrations activate the mitotic stopwatch in a portion of cells, in addition to inducing cell death due to chromosome loss, activation of an immune response, and chromothripsis. Results should be interpreted in the context of this complexity.

      Please see the response to the public review. 

      (3) It would be helpful to show that CUL3 interacts with 53BP1 only in the presence of GMCL1.

      We show that the binding of 53BP1 to GMCL1 is independent of the ability of GMCL1 to bind CUL3 (Figure 1C, D). The binding between 53BP1 and CUL3 is difficult to detect (Figure 1F) likely because it’s not direct but mediated by GMCL1.

      (4) The GMCL1 "KO" lines appear to still express a low level of GMCL1 (Figure 2A), which should be acknowledged

      We have included the GMCL1 mRNA expression data, as measured by RT-PCR, in Supplementary Figure 1G, demonstrating that GMCL1 expression was undetectable under the tested conditions.

      (5) Additional description of the methods is warranted. This is particularly true for the database analysis that forms the basis for the claim that GMCL1 overexpression causes resistance to paclitaxel and other taxanes presented in Figure 3, the methodology used to obtain M-phase cells, and the concentration and duration of taxol treatment.

      We have now extensively revised the Methods section.  

      (6) "Taxol" and "paclitaxel" are used interchangeably throughout the manuscript. Consistency would be preferable.

      We have revised the manuscript to maintain consistency in the use of the terms “Taxol” and “paclitaxel” and now refer to “paclitaxel” when discussing that individual compound; “taxanes” when referring collectively to cabazitaxel, docetaxel and paclitaxel; and “Taxol” has been removed entirely to avoid redundancy or confusion.    

      (7) It is unclear why it is claimed that GMCL1 interacts "specifically" with 53BP1 (line 176) since multiple interactors were identified in the IP-MS study

      We meant that the GMCL1 R433A mutant loses its ability to bind 53BP1, suggesting that the GMCL1-53BP1 interaction is not an artifact. We have now clarified the text. 

      (8) The bottom row in Figure S3 is misleading. Paclitaxel is not uniformly effective in every tumor of any given type, and so resistance occurs in every cancer type.

      We fully agree that cancer is highly heterogeneous and that paclitaxel efficacy varies across tumors, even within the same histological subtype. Our intension was not to suggest uniform sensitivity/resistance, but rather to provide a high-level overview using aggregated data. We acknowledge that this coarse-grained representation may unintentionally imply overly generalized conclusions. To avoid potential misinterpretation, we have removed the corresponding panel in the revised paper.

    1. Author response:

      Here we provide a provisional response addressing the public comments and outlining the revisions we are planning to make:

      (1) We will add additional baseline models to delineate the contributions of the acoustic and linguistic pathways.

      (2) We will show additional ablation analysis and other model comparison results, as suggested by the reviewers, to justify the choice of the DNN models.

      (3) We will clarify the use of the TIMIT dataset during pre-training. In fact, the TIMIT speech data (the speech corpora used in the test set) was not included or used when pre-training the acoustic or linguistic pathway. It was only used in fine-tuning the final speech synthesizer (the cosyvoice model). We will present results without this fine-tuning step, which will fully eliminate the usage of the TIMIT data during model training.

      (4) We will further analyze the phoneme confusion matrices and/or other data to evaluate the model behavior.

      (5) We will analyze the test sentences with high and low accuracies. We will also include results with partial training data (e.g. using 25%, 50%, 75% of the training set) to further evaluate the impact of the total amount of training data.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This research group has consistently performed cutting-edge research aiming to understand the role of hormones in the control of social behaviors, specifically by utilizing the genetically-tractable teleost fish, medaka, and the current work is no exception. The overall claim they make, that estrogens modulate social behaviors in males and females is supported, with important caveats. For one, there is no evidence these estrogens are generated by "neurons" as would be assumed by their main claim that it is NEUROestrogens that drive this effect. While indeed the aromatase they have investigated is expressed solely in the brain, in most teleosts, brain aromatase is only present in glial cells (astrocytes, radial glia). The authors should change this description so as not to mislead the reader. Below I detail more specific strengths and weaknesses of this manuscript.

      We thank the reviewer for this positive evaluation of our work and for the helpful comments and suggestions. Regarding the concern that the term “neuroestrogens” may be misleading, we addressed this in the previous revision by consistently replacing it throughout the manuscript with “brain-derived estrogens” or “brain estrogens.”

      In addition, the following sentence was added to the Introduction (line 61): “In teleost brains, including those of medaka, aromatase is exclusively localized in radial glial cells, in contrast to its neuronal localization in rodent brains (Forlano et al., 2001; Diotel et al., 2010; Takeuchi and Okubo, 2013).”

      Strenghth:

      Excellent use of the medaka model to disentangle the control of social behavior by sex steroid hormones 

      The findings are strong for the most part because deficits in the mutants are restored by the molecule (estrogens) that was no longer present due to the mutation 

      Presentation of the approach and findings are clear, allowing the reader to make their own inferences and compare them with the authors' 

      Includes multiple follow-up experiments, which leads to tests of internal replication and an impactful mechanistic proposal 

      Findings are provocative not just for teleost researchers, but for other species since, as the authors point out, the data suggest mechanisms of estrogenic control of social behaviors may be evolutionary ancient 

      We thank the reviewer again for their positive evaluation of our work.

      Weakness:

      As stated in the summary, the authors are attributing the estrogen source to neurons and there isn't evidence this is the case. The impact of the findings doesn't rest on this either

      As mentioned above, we addressed this in the previous revision by replacing “neuroestrogens” with “brain-derived estrogens” or “brain estrogens” throughout the manuscript. In addition, the following sentence was added to the Introduction (line 61): “In teleost brains, including those of medaka, aromatase is exclusively localized in radial glial cells, in contrast to its neuronal localization in rodent brains (Forlano et al., 2001; Diotel et al., 2010; Takeuchi and Okubo, 2013).”

      The d4 versus d8 esr2a mutants showed different results for aggression. The meaning and implications of this finding are not discussed, leaving the reader wondering

      This comment is the same as one raised in the first review (Reviewer #1’s comment 2 on weaknesses), which we already addressed in our initial revision. For the reviewer’s convenience, we provide the response below:

      Line 300: As the reviewer correctly noted, circles were significantly reduced in mutant males of the Δ8 line, whereas no significant reduction was observed in those of the Δ4 line. However, a tendency toward reduction was evident in the Δ4 line (P = 0.1512), and both lines showed significant differences in fin displays. Based on these findings, we believe our conclusion that esr2a<sup>−/−</sup> males exhibit reduced aggression remains valid. To clarify this point and address potential reader concerns, we have revised the text as follows: “esr2a<sup>−/−</sup> males exhibited significantly fewer fin displays (P = 0.0461 and 0.0293 for Δ8 and Δ4 lines, respectively) and circles (P = 0.0446 and 0.1512 for Δ8 and Δ4 lines, respectively) than their wild-type siblings (Fig. 5L; Fig. S8E), suggesting less aggression” was edited to read “esr2a<sup>−/−</sup> males from both the Δ8 and Δ4 lines exhibited significantly fewer fin displays than their wild-type siblings (P = 0.0461 and 0.0293, respectively). Circles followed a similar pattern, with a significant reduction in the Δ8 line (P = 0.0446) and a comparable but non-significant decrease in the Δ4 line (P =0.1512) (Figure 5L, Figure 5—figure supplement 3E), showing less aggression.”

      Lack of attribution of previous published work from other research groups that would provide the proper context of the present study

      This comment is also the same as one raised in the first review (Reviewer #1’s comment 3 on weaknesses). In our previous revision, in response to this comment, we cited the relevant references (Hallgren et al., 2006; O’Connell and Hofmann, 2012; Huffman et al., 2013; Jalabert et al., 2015; Yong et al., 2017; Alward et al., 2020; Ogino et al., 2023) in the appropriate sections. We also added the following new references and revised the Introduction and Discussion accordingly:

      (2) Alward BA, Laud VA, Skalnik CJ, York RA, Juntti SA, Fernald RD. 2020. Modular genetic control of social status in a cichlid fish. Proceedings of the National Academy of Sciences of the United States of America 117:28167–28174. DOI: https://doi.org/10.1073/pnas.2008925117

      (39) O’Connell LA, Hofmann HA. 2012. Social status predicts how sex steroid receptors regulate complex behavior across levels of biological organization. Endocrinology 153:1341–1351. DOI:https://doi.org/10.1210/en.2011-1663

      (54) Yong L, Thet Z, Zhu Y. 2017. Genetic editing of the androgen receptor contributes to impaired male courtship behavior in zebrafish. Journal of Experimental Biology 220:3017–3021.DOI:https://doi.org/10.1242/jeb.161596

      There are a surprising number of citations not included; some of the ones not included argue against the authors' claims that their findings were "contrary to expectation"

      In our previous revision, we cited the relevant references (Hallgren et al., 2006; O’Connell and Hofmann, 2012; Huffman et al., 2013; Jalabert et al., 2015) in the Introduction. We also revised the text to remove phrases such as “contrary to expectation” and “unexpected.”

      The experimental design for studying aggression in males has flaws. A standard test like a residentintruder test should be used.

      Following this comment, we have attempted additional aggression assays using the resident-intruder paradigm. However, these experiments did not produce consistent or interpretable results. As noted in our previous revision, medaka naturally form shoals and exhibit weak territoriality, and even slight differences in dominance between a resident and an intruder can markedly increase variability, reducing data reliability. Therefore, we believe that the approach used in the present study provides a more suitable assessment of aggression in medaka, regardless of territorial tendencies. We will continue to explore potential refinements in future studies and respectfully ask the reviewer to evaluate the present work based on the assay used here.

      While they investigate males and females, there are fewer experiments and explanations for the female results, making it feel like a small addition or an aside

      While we did not adopt this comment in our previous revision, we have carefully reconsidered the reviewers’ feedback and have now decided to remove the female data. This change allows us to present a more focused and cohesive story centered on males. The specific revisions are outlined below:

      Abstract

      Line 25: The text “, thereby revealing a previously unappreciated mode of action of brain-derived estrogens. We additionally show that female fish lacking Cyp19a1b are less receptive to male courtship and conversely court other females, highlighting the significance of brain-derived estrogens in establishing sex-typical behaviors in both sexes.” has been revised to “. Taken together, these findings reveal a previously unappreciated mode of action of brain-derived estrogens in shaping male-typical behaviors.”

      Results

      Line 88: The text “Loss of cyp19a1b function in these fish was verified by measuring brain and peripheral levels of sex steroids. As expected, brain estradiol-17β (E2) in both male and female homozygous mutants (cyp19a1b<sup>−/−</sup>) was significantly reduced to 16% and 50%, respectively, of the levels in their wild-type (cyp19a1b<sup>+/+</sup>) siblings (P = 0.0037, males; P = 0.0092, females) (Fig. 1, A and B). In males, brain E2 in heterozygotes (cyp19a1b<sup>−/−</sup>) was also reduced to 45% of the level in wild-type siblings (P = 0.0284) (Fig. 1A), indicating a dosage effect of cyp19a1b mutation. In contrast, peripheral E2 levels were unaltered in both cyp19a1b<sup>−/−</sup> males and females (Fig. S1, C and D), consistent with the expected functioning of Cyp19a1b primarily in the brain. Strikingly, brain levels of testosterone, as opposed to E2, increased 2.2-fold in cyp19a1b<sup>−/−</sup> males relative to wild-type siblings (P = 0.0006) (Fig. 1A). Similarly, brain 11KT levels in cyp19a1b<sup>−/−</sup> males and females increased 6.2- and 1.9-fold, respectively, versus wild-type siblings (P = 0.0007, males; P = 0.0316, females) (Fig. 1, A and B). These results show that cyp19a1b-deficient fish have reduced estrogen levels coupled with increased androgen levels in the brain, confirming the loss of cyp19a1b function. They also suggest that the majority of estrogens in the male brain and half of those in the female brain are synthesized locally in the brain. In addition, peripheral 11KT levels in cyp19a1b<sup>−/−</sup> males and females increased 3.7- and 1.8-fold, respectively (P = 0.0789, males; P = 0.0118, females) (Fig. S1, C and D), indicating peripheral influence in addition to central effects.” has been revised to “Loss of cyp19a1b function in these fish was verified by measuring brain and peripheral levels of sex steroids in males. As expected, brain estradiol-17β (E2) in homozygous mutants (cyp19a1b<sup>−/−</sup>) was significantly reduced to 16% of the levels in wild-type (cyp19a1b<sup>+/+</sup>) siblings (P = 0.0037) (Figure 1A). Brain E2 in heterozygotes (cyp19a1b<sup>+/−</sup>) was also reduced to 45% of wild-type levels (P = 0.0284) (Figure 1A), indicating a dosage effect of the cyp19a1b mutation. In contrast, peripheral E2 levels were unaltered in cyp19a1b<sup>−/−</sup> males (Figure 1B), consistent with the expected functioning of Cyp19a1b primarily in the brain. Strikingly, brain testosterone levels, as opposed to E2, increased 2.2-fold in cyp19a1b<sup>−/−</sup> males relative to wild-type siblings (P = 0.0006) (Figure 1A). Similarly, brain 11KT levels increased 6.2-fold (P = 0.0007) (Figure 1A). These results indicate that cyp19a1b-deficient males have reduced estrogen coupled with elevated androgen levels in the brain, confirming the loss of cyp19a1b function. They also suggest that the majority of estrogens in the male brain are synthesized locally in the brain. Peripheral 11KT levels also increased 3.7-fold in cyp19a1b<sup>−/−</sup> males (P = 0.0789) (Figure 1B), indicating peripheral influence in addition to central effects.”

      Line 211: “expression of vt in the pNVT of cyp19a1b<sup>−/−</sup> males was significantly reduced to 18% as compared with cyp19a1b<sup>+/+</sup> males (P = 0.0040), a level comparable to that observed in females” has been revised to “expression of vt in the pNVT of cyp19a1b<sup>−/−</sup> males was significantly reduced to 18% as compared with cyp19a1b<sup>+/+</sup> males (P = 0.0040).”

      The subsection entitled “cyp19a1b-deficient females are less receptive to males and instead court other females,” which followed line 311, has been removed.

      Discussion

      The two paragraphs between lines 373 and 374, which addressed the female data, have been removed.

      Materials and methods

      Line 433: “males and females” has been changed to “males”.

      Line 457: “focal fish” has been changed to “focal male”.

      Line 458: “stimulus fish” has been changed to “stimulus female”.

      Line 458: “Fig. 6, E and F, ” has been deleted.

      Line 460: “; wild-type males in Fig. 6, A to C” has been deleted.

      Line 466: The text “The period of interaction/recording was extended to 2 hours in tests of courtship displays received from the stimulus esr2b-deficient female and in tests of mating behavior between females, because they take longer to initiate courtship (12). In tests using an esr2b-deficient female as the stimulus fish, where the latency to spawn could not be calculated because these fish were unreceptive to males and did not spawn, the sexual motivation of the focal fish was instead assessed by counting the number of courtship displays and wrapping attempts in 30 min. The number of these mating acts was also counted in tests to evaluate the receptivity of females. In tests of mating behavior between two females, the stimulus female was marked with a small notch in the caudal fin to distinguish it from the focal female.” has been revised to “In tests using an esr2b-deficient female as the stimulus fish, the latency to spawn could not be calculated because the female was unreceptive to males and did not spawn. Therefore, the sexual motivation of the focal male was assessed by counting the number of courtship displays and wrapping attempts in 30 min. To evaluate courtship displays performed by stimulus esr2bdeficient females toward focal males, the recording period was extended to 2 hours, as these females take longer to initiate courtship (Nishiike et al., 2021). In all video analyses, the researcher was blind to the fish genotype and treatment.”

      Line 499: “brains dissected from males and females of the cyp19a1b-deficient line (analysis of ara, arb, vt, gal, npba, and esr2b) and males of the esr1-, esr2a-, and esr2b-deficient lines” has been revised to “male brains from the cyp19a1b-deficient line (analysis of ara, arb, vt, and gal) and from the esr1-, esr2a-, and esr2b-deficient lines.”

      Line 504: “After color development for 15 min (gal), 40 min (npba), 2 hours (vt), or overnight (ara, arb, and esr2b)” has been revised to “After color development for 15 min (gal), 2 hours (vt), or overnight (ara and arb).”

      Line 516: “Thermo Fisher Scientific, Waltham, MA” has been changed to “Thermo Fisher Scientific” to avoid redundancy.

      Line 565: The subsection entitled “Measurement of spatial distances between fish” has been removed.

      Line 585: “6/10 cyp19a1b<sup>+/+</sup>, 3/10 cyp19a1b<sup>+/−</sup>, and 6/10 cyp19a1b<sup>−/−</sup> females were excluded in Fig. 6B;” has been deleted.

      References

      The following references have been removed:

      Capel B. 2017. Vertebrate sex determination: evolutionary plasticity of a fundamental switch. Nature Reviews Genetics 18:675–689. DOI: https://doi.org/10.1038/nrg.2017.60

      Hiraki T, Nakasone K, Hosono K, Kawabata Y, Nagahama Y, Okubo K. 2014. Neuropeptide B is femalespecifically expressed in the telencephalic and preoptic nuclei of the medaka brain. Endocrinology 155:1021–1032. DOI: https://doi.org/10.1210/en.2013-1806

      Juntti SA, Hilliard AT, Kent KR, Kumar A, Nguyen A, Jimenez MA, Loveland JL, Mourrain P, Fernald RD. 2016. A neural basis for control of cichlid female reproductive behavior by prostaglandin F2α. Current Biology 26:943–949. DOI: https://doi.org/10.1016/j.cub.2016.01.067

      Kimchi T, Xu J, Dulac C. 2007. A functional circuit underlying male sexual behaviour in the female mouse brain. Nature 448:1009–1014. DOI: https://doi.org/10.1038/nature06089

      Kobayashi M, Stacey N. 1993. Prostaglandin-induced female spawning behavior in goldfish (Carassius auratus) appears independent of ovarian influence. Hormones and Behavior 27:38–55.

      DOI:https://doi.org/10.1006/hbeh.1993.1004

      Liu H, Todd EV, Lokman PM, Lamm MS, Godwin JR, Gemmell NJ. 2017. Sexual plasticity: a fishy tale. Molecular Reproduction and Development 84:171–194. DOI: https://doi.org/10.1002/mrd.22691

      Munakata A, Kobayashi M. 2010. Endocrine control of sexual behavior in teleost fish. General and Comparative Endocrinology 165:456–468. DOI: https://doi.org/10.1016/j.ygcen.2009.04.011

      Nugent BM, Wright CL, Shetty AC, Hodes GE, Lenz KM, Mahurkar A, Russo SJ, Devine SE, McCarthy MM. 2015. Brain feminization requires active repression of masculinization via DNA methylation. Nature Neuroscience 18:690–697. DOI: https://doi.org/10.1038/nn.3988

      Shaw K, Therrien M, Lu C, Liu X, Trudeau VL. 2023. Mutation of brain aromatase disrupts spawning behavior and reproductive health in female zebrafish. Frontiers in Endocrinology 14:1225199.

      DOI:https://doi.org/10.3389/fendo.2023.1225199

      Stacey NE. 1976. Effects of indomethacin and prostaglandins on the spawning behaviour of female goldfish. Prostaglandins 12:113–126. DOI: https://doi.org/10.1016/s0090-6980(76)80010-x

      Figure 1

      Panel B, which originally showed steroid levels in female brains, has been replaced with steroid levels in the periphery of males, originally presented in Figure S1, panel C. Accordingly, the legend “(A and B) Levels of E2, testosterone, and 11KT in the brain of adult cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males (A) and females (B) (n = 3 per genotype and sex).” has been revised to “(A, B) Levels of E2, testosterone, and 11KT in the brain (A) and periphery (B) of adult cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males (n = 3 per genotype).”

      Figure 3

      The female data have been deleted from Figure 3. The revised Figure 3 is presented.

      The corresponding legend text has been revised as follows:

      Line 862: “males and females (n = 4 and 5 per genotype for males and females, respectively)” has been changed to “males (n = 4 per genotype)”.

      Line 864: “males and females (n = 4 except for cyp19a1b<sup>+/+</sup> males, where n = 3)” has been changed to “males (n = 3 and 4, respectively)”.

      Figure 6

      Figure 6 and its legend have been removed.

      Figure 1—figure supplement 1

      Panel C, showing male data, has been moved to Figure 1B, as described above, while panel D, showing female data, has been deleted. The corresponding legend “(C and D) Levels of E2, testosterone, and 11KT in the periphery of adult cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males (C) and females (D) (n = 3 per genotype and sex). Statistical differences were assessed by Bonferroni’s post hoc test (C and D). Error bars represent SEM. *P < 0.05.” has also been removed.

      Line 804: Following this change, the figure title has been updated from “Generation of cyp19a1bdeficient medaka and evaluation of peripheral sex steroid levels” to “Generation of cyp19a1b-deficient medaka.”

      The statistics comparing "experimental to experimental" and "control to experimental" isn't appropriate 

      This comment is the same as one raised in the first review (Reviewer #1’s comment 7 on weaknesses), which we already addressed in our initial revision. For the reviewer’s convenience, we provide the response below:

      The reviewer raised concerns about the statistical analysis used for Figures 4C and 4E, suggesting that Bonferroni’s test should be used instead of Dunnett’s test. However, Dunnett’s test is commonly used to compare treatment groups to a reference group that receives no treatment, as in our study. Since we do not compare the treated groups with each other, we believe Dunnett’s test is the most appropriate choice.

      Line 576: The reviewer’s concern may have arisen from the phrase “comparisons between control and experimental groups” in the Materials and methods. We have revised it to “comparisons between untreated and E2-treated groups in Figure 4C and D” for clarity.

      Reviewer #3 (Public Review):

      Summary:

      Taking advantage of the existence in fish of two genes coding for estrogen synthase, the enzyme aromatase, one mostly expressed in the brain (Cyp19a1b) and the other mostly found in the gonads (Cyp19a1a), this study investigates the role of brain-derived estrogens in the control of sexual and aggressive behavior in medaka. The constitutive deletion of Cyp19a1b markedly reduced brain estrogen content in males and to a lesser extent in females. These effects are accompanied by reduced sexual and aggressive behavior in males and reduced preference for males in females. These effects are reversed by adult treatment with supporting a role for estrogens. The deletion of Cyp19a1b is associated with a reduced expression of the genes coding for the two androgen receptors, ara and arb, in brain regions involved in the regulation of social behavior. The analysis of the gene expression and behavior of mutants of estrogen receptors indicates that these effects are likely mediated by the activation of the esr1 and esr2a isoforms. These results provide valuable insight into the role of estrogens in social behavior in the most abundant vertebrate taxon, however the conclusion of brain-derived estrogens awaits definitive confirmation.

      We thank this reviewer for their positive evaluation of our work and comments that have improved the manuscript.

      Strength:

      Evaluation of the role of brain "specific" Cyp19a1 in male teleost fish, which as a taxon are more abundant and yet proportionally less studied that the most common birds and rodents. Therefore, evaluating the generalizability of results from higher vertebrates is important. This approach also offers great potential to study the role of brain estrogen production in females, an understudied question in all taxa.

      Results obtained from multiple mutant lines converge to show that estrogen signaling, likely synthesized in the brain drives aspects of male sexual behavior.

      The comparative discussion of the age-dependent abundance of brain aromatase in fish vs mammals and its role in organization vs activation is important beyond the study of the targeted species.  - The authors have made important corrections to tone down some of the conclusions which are more in line with the results. 

      We thank the reviewer again for their positive evaluation of our work and the revisions we have made.

      weaknesses:

      No evaluation of the mRNA and protein products of Cyp19a1b and ESR2a are presented, such that there is no proper demonstration that the mutation indeed leads to aromatase reduction. The conclusion that these effects dependent on brain derived estrogens is therefore only supported by measures of E2 with an EIA kit that is not validated. No discussion of these shortcomings is provided in the discussion thus further weakening the conclusion manuscript.

      In response to this and other comments, we have now provided direct validation that the cyp19a1b mutation in our medaka leads to loss of function. Real-time PCR analysis showed that cyp19a1b transcript levels in the brain were reduced by approximately half in cyp19a1b<sup>+/−</sup> males and were nearly absent in cyp19a1b<sup>−/−</sup> males, consistent with nonsense-mediated mRNA decay

      In addition, AlphaFold 3-based structural modeling indicated that the mutant Cyp19a1b protein lacks essential motifs, including the aromatic region and heme-binding loop, and exhibits severe conformational distortion (see figure; key structural features are annotated as follows: membrane helix (blue), aromatic region (red), and heme-binding loop (orange)). 

      Results:

      Line 101: The following text has been added: “Loss of cyp19a1b function was further confirmed by measuring cyp19a1b transcript levels in the brain and by predicting the three-dimensional structure of the mutant protein. Real-time PCR revealed that transcript levels were reduced by half in cyp19a1b<sup>+/−</sup> males and were nearly undetectable in cyp19a1b<sup>−/−</sup> males, presumably as a result of nonsense-mediated mRNA decay (Lindeboom et al., 2019) (Figure 1C). The wild-type protein, modeled by AlphaFold 3, exhibited a typical cytochrome P450 fold, including the membrane helix, aromatic region, and hemebinding loop, all arranged in the expected configuration (Figure 1—figure supplement 1C). The mutant protein, in contrast, was severely truncated, retaining only the membrane helix (Figure 1—figure supplement 1C). The absence of essential domains strongly indicates that the allele encodes a nonfunctional Cyp19a1b protein. Together, transcript and structural analyses consistently demonstrate that the mutation generated in this study causes a complete loss of cyp19a1b function.”

      Materials and methods

      Line 438: A subsection entitled “Real-time PCR” has been added. The text of this subsection is as follows: “Total RNA was isolated from the brains of cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males using the RNeasy Plus Universal Mini Kit (Qiagen, Hilden, Germany). cDNA was synthesized with the SuperScript VILO cDNA Synthesis Kit (Thermo Fisher Scientific, Waltham, MA). Real-time PCR was performed on the LightCycler 480 System II using the LightCycler 480 SYBR Green I Master (Roche Diagnostics). Melting curve analysis was conducted to verify that a single amplicon was obtained in each sample. The β-actin gene (actb; GenBank accession number NM_001104808) was used to normalize the levels of target transcripts. The primers used for real-time PCR are shown in Supplementary file 2.”

      Line 448: A subsection entitled “Protein structure prediction” has been added. The text of this subsection is as follows: “Structural predictions of Cyp19a1b proteins were conducted using AlphaFold 3 (Abramson et al., 2024). Amino acid sequences corresponding to the wild-type allele and the mutant allele generated in this study were submitted to the AlphaFold 3 prediction server. The resulting models were visualized with PyMOL (Schrödinger, New York, NY), and key structural features, including the membrane helix, aromatic region, and heme-binding loop, were annotated.”

      References

      The following two references have been added:

      Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, Ronneberger O, Willmore L, Ballard AJ, Bambrick J, Bodenstein SW, Evans DA, Hung CC, O'Neill M, Reiman D, Tunyasuvunakool K, Wu Z, Žemgulytė A, Arvaniti E, Beattie C, Bertolli O, Bridgland A, Cherepanov A, Congreve M, CowenRivers AI, Cowie A, Figurnov M, Fuchs FB, Gladman H, Jain R, Khan YA, Low CMR, Perlin K, Potapenko A, Savy P, Singh S, Stecula A, Thillaisundaram A, Tong C, Yakneen S, Zhong ED, Zielinski M, Žídek A, Bapst V, Kohli P, Jaderberg M, Hassabis D, Jumper JM. 2024. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630:493–500. DOI: https://doi.org/10.1038/s41586-024-07487-w

      Lindeboom RGH, Vermeulen M, Lehner B, Supek F. 2019. The impact of nonsense-mediated mRNA decay on genetic disease, gene editing and cancer immunotherapy. Nature Genetics 51:1645–1651.DOI:https://doi.org/10.1038/s41588-019-0517-5

      Figure 1

      The real-time PCR results described above have been incorporated in Figure 1, panel C, with the corresponding legend provided below (line 788).

      (C) Brain cyp19a1b transcript levels in cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males (n = 6 per genotype). Mean value for cyp19a1b<sup>+/+</sup> males was arbitrarily set to 1.

      The subsequent panels have been renumbered accordingly. The entirety of the revised Figure 1.

      Figure 1—figure supplement 1

      The AlphaFold 3-generated structural models described above have been incorporated in Figure 1— figure supplement 1, panel C, with the corresponding legend provided below (line 811).

      (C) Predicted three-dimensional structures of wild-type (left) and mutant (right) Cyp19a1b proteins. Key structural features are annotated as follows: membrane helix (blue), aromatic region (red), and heme-binding loop (orange).

      The entirety of the revised Figure 1—figure supplement 1 is presented

      The information on the primers used for real-time PCR has been included in Supplementary file 2.

      The functional deficiency of esr2a was already addressed in the previous revision. For clarity, we have reproduced the relevant information here.

      A previous study reported that female medaka lacking esr2a fail to release eggs due to oviduct atresia (Kayo et al., 2019, Sci Rep 9:8868). Similarly, in this study, some esr2a-deficient females exhibited spawning behavior but were unable to release eggs, although the sample size was limited (Δ8 line: 2/3; Δ4 line: 1/1). In contrast, this was not observed in wild-type females (Δ8 line: 0/12; Δ4 line: 0/11). These results support the effective loss of esr2a function. To incorporate this information into the manuscript, the following text has been added to the Materials and methods (line 423): “A previous study reported that esr2a-deficient female medaka cannot release eggs due to oviduct atresia (Kayo et al., 2019). Likewise, some esr2a-deficient females generated in this study, despite the limited sample size, exhibited spawning behavior but were unable to release eggs (Δ8 line: 2/3; Δ4 line: 1/1), while such failure was not observed in wild-type females (Δ8 line: 0/12; Δ4 line: 0/11). These results support the effective loss of esr2a function.”

      Most experiments are weakly powered (low sample size).

      This comment is essentially the same as one raised in the first review (Reviewer #3’s comment 7 on weaknesses). We acknowledge the reviewer’s concern that the histological analyses were weakly powered due to the limited sample size. In our earlier revision, we responded as follows:

      Histological analyses were conducted with a relatively small sample size, as our previous experience suggested that interindividual variability in the results would not be substantial. Since significant differences were detected in many analyses, further increasing the sample size was deemed unnecessary.

      The variability of the mRNA content for a same target gene between experiments (genotype comparison vs E2 treatment comparison) raises questions about the reproducibility of the data (apparent disappearance of genotype effect).

      This comment is the same as one raised in the first review (Reviewer #3’s comment 8 on weaknesses), which we already addressed in our initial revision. For the reviewer’s convenience, we provide the response below:

      As the reviewer pointed out, the overall area of ara expression is larger in Figure 2J than in Figure 2F. However, the relative area ratios of ara expression among brain nuclei are consistent between the two figures, indicating the reproducibility of the results. Thus, this difference is unlikely to affect the conclusions of this study.

      Additionally, the differences in ara expression in pPPp and arb expression in aPPp between wild-type and cyp19a1b-deficient males appear less pronounced in Figures 2J and 2K than in Figures 2F and 2H. This is likely attributable to the smaller sample size used in the experiments for Figures 2J and 2K, resulting in less distinct differences. However, as the same genotype-dependent trends are observed in both sets of figures, the conclusion that ara and arb expression is reduced in cyp19a1b-deficient male brains remains valid.

      Conclusions:

      Overall, the claims regarding role of estrogens originating in the brain on male sexual behavior is supported by converging evidence from multiple mutant lines. The role of brain-derived estrogens on gene expression in the brain is weaker as are the results in females. 

      We appreciate the reviewer’s positive evaluation of our findings on male behavior. The concern regarding the role of brain-derived estrogens in gene expression has been addressed in our rebuttal, and the female data have been removed so that the analysis now focuses on males. The specific revisions for removing the female data are described in Response to reviewer #1’s comment 6 on weaknesses.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      The manuscript is improved slightly. I am thankful the authors addressed some concerns, but for several concerns the referees raised, the authors acknowledged them yet did not make corresponding changes to the manuscript or disagreed that they were issues at all without explanation. All reviewers had issues with the imbalanced focus on males versus females and the male aggression assay. Yet, they did not perform additional experiments or even make changes to the framing and scope of the manuscript. If the authors had removed the female data, they may have had a more cohesive story, but then they would still be left with inadequate behavior assays in the males. If the authors don't have the time or resources to perform the additional work, then they should have said so. However, the work would be incomplete relative to the claims. That is a key point here. If they change their scope and claims, the authors avoid overstating their findings. I want to see this work published because I believe it moves the field forward. But the authors need to be realistic in their interpretations of their data. 

      In response to this and related comments, we have removed the female data and focused the manuscript on analyses in males. The specific revisions are described in Response to reviewer #1’s comment 6 on weaknesses. Additionally, we have validated that the cyp19a1b mutation in our medaka leads to loss of function (see Response to reviewer #3’s comment 1 on weaknesses), which further strengthens the reliability of our conclusions regarding male behavior.

      I agree with the reviewer who said we need to see validation of the absence of functional cyp19a1 b in the brain. However, the results from staining for the protein and performing in situ could be quizzical. Indeed, there aren't antibodies that could distinguish between aromatase a and b, and it is not uncommon for expression of a mutated gene to be normal. One approach they could do is measure aromatase activity, but they are *sort of* doing that by measuring brain E2. It's not perfect, but we teleost folks are limited in these areas. At the very least, they should show the predicted protein structure of the mutated aromatase alleles. It could show clearly that the tertiary structure is utterly absent, giving more support to the fact that their aromatase gene is non-functional. 

      As noted above, we have further validated the loss of cyp19a1b function by measuring cyp19a1b transcript levels in the brain and predicting the three-dimensional structure of the mutant protein. These analyses confirmed that cyp19a1b function is indeed lost, thereby increasing the reliability of our conclusions. For further details, please refer to Response to reviewer #3’s comment 1 on weaknesses.

      With all of this said, the work is important, and it is possible that with a reframing of the impact of their work in the context of their findings, I could consider the work complete. I think with a proper reframing, the work is still impactful. 

      In accordance with this feedback, and as described above, we have reframed the manuscript by removing the female data and focusing exclusively on males. This revision clarifies the scope of our study and reinforces the support for our conclusions. For further details, please refer to Response to reviewer #1’s comment 6 on weaknesses.

      (1) Clearly state in the Figure 1 legend that each data point for male aggressive behaviors represents the total # of behaviors calculated over the 4 males in each experimental tank.

      In response to this comment, we have revised the legend of Figure 1K (line 797). The original legend, “(K) Total number of each aggressive act observed among cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, or cyp19a1<sup>−/−</sup> males in the tank (n = 6, 7, and 5, respectively),” has been updated to “(K) Total number of each aggressive act performed by cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males. Each data point represents the sum of acts recorded for the 4 males of the same genotype in a single tank (n = 6, 7, and 5 tanks, respectively).” This clarifies that each data point reflects the total behaviors of the 4 males within each tank.

      (2) The authors wrote under "Response to reviewer #1's major comment "...the development of male behaviors may require moderate neuroestrogen levels that are sufficient to induce the expression of ara and arb, but not esr2b, in the underlying neural circuitry": "This may account for the lack of aggression recovery in E2-treated cyp19a1b-deficient males in this study.".

      What is meant by the latter statement? What accounts for the lack of aggression? The lack of increase in esr2b? Please clarify. 

      Line 365: In response to this comment, “This may account for the lack of aggression recovery in E2treated cyp19a1b-deficient males in this study.” has been revised to “Considering this, the lack of aggression recovery in E2-treated cyp19a1b-deficient males in this study may be explained by the possibility that the E2 dose used was sufficient to induce not only ara and arb but also esr2b expression in aggression-relevant circuits, which potentially suppressed aggression.”

      This revision clarifies that, while moderate brain estrogen levels are sufficient to promote male behaviors via induction of ara and arb, the E2 dose used in this study may have additionally induced esr2b in circuits relevant to aggression, potentially underlying the lack of aggression recovery.

      (3) This is a continuation of my comment/concern directly above. If the induction of ara and arb aren't enough, then how can, as the authors state, androgen signaling be the primary driver of these behaviors? 

      In response to this follow-up comment, we would like to clarify that, as described above, the lack of aggression recovery in E2-treated cyp19a1b-deficient males is not due to insufficient induction of ara and arb, but instead is likely because esr2b was also induced in aggression-relevant circuits, which may have suppressed aggression. Therefore, the concern that androgen signaling cannot be the primary driver of these behaviors is not applicable.

      (4) The authors' point about sticking with the terminology for the ar genes as "ara" and "arb" is not convincing. The whole point of needing a change to match the field of neuroendocrinology as a whole (that is, across all vertebrates) is researchers, especially those with high standing like the Okubo group, adopt the new terminology. Indeed, the Okubo group is THE leader in medaka neuroendocrinology. It would go a long way if they began adopting the new terminology of "ar1" and "ar2". I understand this may be laborious to a degree, and each group can choose to use their terminology, but I'd be remiss if I didn't express my opinion that changing the terminology could help our field as a whole. 

      We sincerely appreciate the reviewer’s thoughtful comments regarding nomenclature consistency in vertebrate neuroendocrinology. We understand the motivation behind the suggestion to adopt ar1 and ar2. However, we consider the established nomenclature of ara and arb to be more appropriate for the following reasons.

      First, adopting the ar1/ar2 nomenclature would introduce a discrepancy between gene and protein symbols. According to the NCBI International Protein Nomenclature Guidelines (Section 2B.Abbreviations and symbols;

      https://www.ncbi.nlm.nih.gov/genbank/internatprot_nomenguide/), the ZFIN Zebrafish Nomenclature Conventions (Section 2. PROTEINS:https://zfin.atlassian.net/wiki/spaces/general/pages/1818394635/ZFIN+Zebrafish+Nomenclature+Con ventions), and the author guidelines of many journal

      (e.g.,https://academic.oup.com/molehr/pages/Gene_And_Protein_Nomenclature), gene and protein symbols should be identical (with proteins designated in non-italic font and with the first letter capitalized). Maintaining consistency between gene and protein symbols helps avoid unnecessary confusion. The ara/arb nomenclature allows this, whereas ar1/ar2 does not.

      Second, the two androgen receptor genes in teleosts are paralogs derived from the third round of wholegenome duplication that occurred early in teleost evolution. For such duplicated genes, the ZFIN Zebrafish Nomenclature Conventions (Section 1.2. Duplicated genes) recommend appending the suffixes “a” and “b” to the approved symbol of the human or mouse ortholog. This convention clearly indicates that these genes are whole-genome duplication paralogs and provides an intuitive way to represent orthologous and paralogous relationships between teleost genes and those of other vertebrates. As a result, it has been widely adopted, and we consider it logical and beneficial to apply the same principle to androgen receptors.

      In light of these considerations, we respectfully maintain that the ara/arb nomenclature is more suitable for the present manuscript than the alternative ar1/ar2 system.

      (5) In the discussion please discuss these potentially unexpected findings.

      (a) gal was unaffected in female cyp19a1 mutants, but they exhibit mating behaviors towards females. Given gal is higher in males and these females act like females, what does this mean about the function of gal/its utility in being a male-specific marker (is it one??)? 

      (b) esr2b expression is higher in female cyp19a1 mutants. this is unexpected as well given esr2b is required for female-typical mating and is higher in females compared to males and E2 increases esr2b expression. please explain...well, what this means for our idea of what esr2b expression tell us. 

      We thank the reviewer for the insightful comments. As the female data have been removed from the manuscript, discussion of these findings in female cyp19a1b mutants is no longer necessary.

      Reviewer #3 (Recommendations For The Authors):

      The authors have addressed a number of answers to the reviewer's comments, notably they provided missing methodological information and rephrased the text. However, the authors have not addressed the main issues raised by the reviewers. Notably, it is regrettable that the reduced amount of brain aromatase cannot be confirmed, this seems to be the primary step when validating a new mutant. Even if protein products of the two genes may not be discriminated (which I can understand), it should be possible to evaluate the expression of a common messenger and/or peptide and confirm that aromatase expression is reduced in the brain. Since Cyp19a1b is relatively more abundant in the brain Cyp19a1a, this would strengthen the conclusion and provide confidence that the mutant indeed does silence aromatase expression in the brain. Although these short comings are acknowledged in the rebuttal letter, this is not mentioned in the discussion. Doing so would make the manuscript more transparent and clearer. 

      As noted in Response to reviewer #3’s comment 1 on weaknesses, we have validated the loss of Cyp19a1b function by measuring its transcript levels in the brain and predicting the three-dimensional structure of the mutant protein. These analyses confirmed that Cyp19a1b function is indeed lost, thereby increasing the reliability of our conclusions.

      FigS1 - panels C&D please indicate in which tissue were hormones measured. Blood?

      We thank the reviewer for pointing this out. In our study, “peripheral” refers to the caudal half of the body excluding the head and visceral organs, not blood. Accordingly, we have revised the figure legend and the description in the Materials and Methods section as follows:

      Legend for Figure 1B (line 787) now reads: “Levels of E2, testosterone, and 11KT in the brain (A) and peripheral tissues (caudal half of the body) (B) of adult cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males (n = 3 per genotype).”

      Materials and methods (line 431): The sentence “Total lipids were extracted from the brain and peripheral tissues (from the caudal half) of” has been revised to “Total lipids were extracted from the brain and from peripheral tissues, specifically the caudal half of the body excluding the head and visceral organs, of.”

      Additional Alterations:

      We have reformatted the text and supporting materials to comply with the journal’s Author Guidelines. The following changes have been made:

      (1) Figures and supplementary files are now provided separately from the main text.

      (2) The title page has been reformatted without any changes to its content.

      (3) In-text citations have been changed from numerical references to the author–year format.

      (4) Figure labels have been revised from “Fig. 1,” “Fig. S1,” etc., to “Figure 1,” “Figure 1—figure supplement 1,” etc.

      (5) Table labels have been revised from “Table S1,” etc., to “Supplementary file 1,” etc.

      (6) Line 324: The typo “is” has been corrected to “are”.

      (7) Line 382: The section heading “Materials and Methods” has been changed to “Materials and methods” (lowercase “m”).

      (8) Line 383: The Key Resources Table has been placed at the beginning of the Materials and methods section.

      (9) Line 389: The sentence “Sexually mature adults (2–6 months) were used for experiments, and tissues were consistently sampled 1–5 hours after lights on.” has been revised to “Sexually mature adults (2–6 months) were used for experiments and assigned randomly to experimental groups. Tissues were consistently sampled 1–5 hours after lights on.”

      (10)  Line 393: The sentence “All fish were handled in accordance with the guidelines of the Institutional Animal Care and Use Committee of the University of Tokyo.” has been removed.

      (11)  Line 589: The following sentence has been added: “No power analysis was conducted due to the lack of relevant data; sample size was estimated based on previous studies reporting inter-individual variation in behavior and neural gene expression in medaka.”

      (12)  Line 598: The reference list has been reordered from numerical sequence to alphabetical order by author.

      (13)  In the figure legends, notations such as “A and B” have been revised to “A, B.”

    1. Author response:

      We would like to thank both reviewers for taking the time to review the manuscript in detail. Your comments have been extremely useful and constructive. A revised version of the manuscript will seek to address the weaknesses raised, clarifying the reasons for the assumptions made, the impact they have and how they influence the policy implication of the work. We will clarify the language to differentiate the work from the standard sub-national tailoring which is typically conducted to support National Malaria Programmes and emphasise why our mechanistic model can provide greater information than simple summary statistics.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Reviews):

      Summary:

      Argunşah et al. describe and investigate the mechanisms underlying the differential response dynamics of barrel vs septa domains of the whisker-related primary somatosensory cortex (S1). Upon repeated stimulation, the authors report that the response ratio between multi- and single-whisker stimulation increases in layer (L) 4 neurons of the septal domain, while remaining constant in barrel L4 neurons. This difference is attributed to the short-term plasticity properties of interneurons, particularly somatostatin-expressing (SST+) neurons. This claim is supported by the increased density of SST+ neurons found in L4 of the septa compared to barrels, along with a stronger response of (L2/3) SST+ neurons to repeated multi- vs single-whisker stimulation. The role of the synaptic protein Elfn1 is then examined. Elfn1 KO mice exhibited little to no functional domain separation between barrel and septa, with no significant difference in single- versus multi-whisker response ratios across barrel and septal domains. Consistently, a decoder trained on WT data fails to generalize to Elfn1 KO responses. Finally, the authors report a relative enrichment of S2- and M1-projecting cell densities in L4 of the septal domain compared to the barrel domain.

      Strengths:

      This paper describes and aims to study a circuit underlying differential response between barrel columns and septal domains of the primary somatosensory cortex. This work supports the view that barrel and septal domains contribute differently to processing single versus multi-whisker inputs, suggesting that the barrel cortex multiplexes sensory information coming from the whiskers in different domains.

      We thank the reviewer for the very neat summary of our findings that barrel cortex multiplexes converging information in separate domains.

      Weaknesses:

      While the observed divergence in responses to repeated SWS vs MWS between the barrel and septal domains is intriguing, the presented evidence falls short of demonstrating that short-term plasticity in SST+ neurons critically underpins this difference. The absence of a mechanistic explanation for this observation limits the work’s significance. The measurement of SST neurons’ response is not specific to a particular domain, and the Elfn1 manipulation does not seem to be specific to either stimulus type or a particular domain.

      We appreciate the reviewer’s perspective. Although further research is needed to understand the circuit mechanisms underlying the observed phenomenon, we believe our data suggest that altering the short-term dynamics of excitatory inputs onto SST neurons reduces the divergent spiking dynamics in barrels versus septa during repetitive single- and multi-whisker stimulation. Future work could examine how SST neurons, whose somata reside in barrels and septa, respond to different whisker stimuli and the circuits in which they are embedded. At this time, however, the authors believe there is no alternative way to test how the short-term dynamics of excitatory inputs onto SST neurons, as a whole, contribute to the temporal aspects of barrel versus septa spiking.

      The study's reach is further constrained by the fact that results were obtained in anesthetized animals, which may not generalize to awake states.

      We appreciate the reviewer’s concern regarding the generalizability of our findings from anesthetized animals to awake states. Anesthesia was employed to ensure precise individual whisker stimulation (and multi-whisker in the same animal), which is challenging in awake rodents due to active whisking. While anesthesia may alter higher-order processing, core mechanisms, such as short and long term plasticity in the barrel cortex, are preserved under anesthesia (Martin-Cortecero et al., 2014; Mégevand et al., 2009).

      The statistical analysis appears inappropriate, with the use of repeated independent tests, dramatically boosting the false positive error rate.

      Thank you for your feedback on our analysis using independent rank-based tests for each time point in wild-type (WT) animals. To address concerns regarding multiple comparisons and temporal dependencies (for Figure 1F and 4D for now but we will add more in our revision), we performed a repeated measures ANOVA for WT animals (13 Barrel, 8 Septa, 20 time points), which revealed a significant main effect of Condition (F(1,19) = 16.33, p < 0.001) and a significant Condition-Time interaction (F(19,361) = 2.37, p = 0.001). Post-hoc tests confirmed significant differences between Barrel and Septa at multiple time points (e.g., p < 0.0025 at times 3, 4, 6, 7, 8, 10, 11, 12, 16, 19 after Bonferroni posthoc correction), supporting a differential multi-whisker vs. single-whisker ratio response in WT animals. In contrast, a repeated measures ANOVA for knock-out (KO) animals (11 Barrel, 7 Septa, 20 time points) showed no significant main effect of Condition (F(1,14) = 0.17, p = 0.684) or Condition-Time interaction (F(19,266) = 0.73, p = 0.791), indicating that the BarrelSepta difference observed in WT animals is absent in KO animals.

      Furthermore, the manuscript suffers from imprecision; its conclusions are occasionally vague or overstated. The authors suggest a role for SST+ neurons in the observed divergence in SWS/MWS responses between barrel and septal domains. However, this remains speculative, and some findings appear inconsistent. For instance, the increased response of SST+ neurons to MWS versus SWS is not confined to a specific domain. Why, then, would preferential recruitment of SST+ neurons lead to divergent dynamics between barrel and septal regions? The higher density of SST+ neurons in septal versus barrel L4 is not a sufficient explanation, particularly since the SWS/MWS response divergence is also observed in layers 2/3, where no difference in SST+ neuron density is found.

      Moreover, SST+ neuron-mediated inhibition is not necessarily restricted to the layer in which the cell body resides. It remains unclear through which differential microcircuits (barrel vs septum) the enhanced recruitment of SST+ neurons could account for the divergent responses to repeated SWS versus MWS stimulation.

      We fully appreciate the reviewer’s comment. We currently do not provide any evidence on the contribution of SST neurons in the barrels versus septa in layer 4 on the response divergence of spiking observed in SWS versus MWS. We only show that these neurons differentially distribute in the two domains in this layer. It is certainly known that there is molecular and circuit-based diversity of SST-positive neurons in different layers of the cortex, so it is plausible that this includes cells located in the two domains of vS1, something which has not been examined so far. Our data on their distribution are one piece of information that SST neurons may have a differential role in inhibiting barrel stellate cells versus septa ones. Morphological reconstructions of SST neurons in L4 of the somatosensory barrel cortex has shown that their dendrites and axons project locally and may confine to individual domains, even though not specifically examined (Fig. 3 of Scala F et al., 2019). The same study also showed that L4 SST cells receive excitatory input from local stellate cells) and is known that they are also directly excited by thalamocortical fibers (Beierlein et al., 2003; Tan et al., 2008), both of which facilitate.

      As shown in our supplementary figure, the divergence is also observed in L2/3 where, as the reviewer also points out, where we do not have a differential distribution of SST cells, at least based on a columnar analysis extending from L4. There are multiple scenarios that could explain this “discrepancy” that one would need to examine further in future studies. One straightforward one is that the divergence in spiking in L2/3 domains may be inherited from L4 domains, where L4 SST act on. Another is that even though L2/3 SST neurons are not biased in their distribution their input-output function is, something which one would need to examine by detailed in vitro electrophysiological and perhaps optogenetic approaches in S1. Despite the distinctive differences that have been found between the L4 circuitry in S1 and V1 (Scala F et al., 2019), recent observations indicate that small but regular patches of V1 marked by the absence of muscarinic receptor 2 (M2) have high temporal acuity (Ji et al., 2015), and selectively receive input from SST interneurons (Meier et al., 2025). Regions lacking M2 have distinct input and output connectivity patterns from those that express M2 (Meier et al., 2021; Burkhalter et al., 2023). These findings, together with ours, suggest that SST cells preferentially innervate and regulate specific domains columns- in sensory cortices.

      Regardless of the mechanism, the Elfn1 knock-out mouse line almost exclusively affects the incoming excitability onto SST neurons (see also reply to comment below), hence what can be supported by our data is that changing the incoming short-term synaptic plasticity onto these neurons brings the spiking dynamics between barrels and septa closer together.

      The Elfn1 KO mouse model seems too unspecific to suggest the role of the short-term plasticity in SST+ neurons in the differential response to repeated SWS vs MWS stimulation across domains. Why would Elfn1-dependent short-term plasticity in SST+ neurons be specific to a pathway, or a stimulation type (SWS vs MWS)? Moreover, the authors report that Elfn1 knockout alters synapses onto VIP+ as well as SST+ neurons (Stachniak et al., 2021; previous version of this paper)-so why attribute the phenotype solely to SST+ circuitry? In fact, the functional distinctions between barrel and septal domains appear largely abolished in the Elfn1 KO.

      Previous work by others and us has shown that globally removing Elfn1 selectively removes a synaptic process from the brain without altering brain anatomy or structure. This allows us to study how the temporal dynamics of inhibition shape activity, as opposed to inhibition from particular cell types. We will nevertheless update the text to discuss more global implications for SST interneuron dynamics and include a reference to VIP interneurons that contain Elfn1.

      When comparing SWS to MWS, we find that MWS replaces the neighboring excitation which would normally be preferentially removed by short-term plasticity in SST interneurons, thus providing a stable control comparison across animals and genotypes. On average, VIP interneurons failed to show modulation by MWS. We were unable to measure a substantial contribution of VIP cells to this process and also note that the Elfn1 expressing multipolar neurons comprise only ~5% of VIP neurons (Connor and Peters, 1984; Stachniak et al., 2021), a fraction that may be lost when averaging from 138 VIP cells. Moreover, the effect of Elfn1 loss on VIP neurons is quite different and marginal compared to that of SST cells, suggesting that the primary impact of Elfn1 knockout is mediated through SST+ interneuron circuitry. Therefore, even if we cannot rule out that these 5% of VIP neurons contribute to barrel domain segregation, we are of the opinion that their influence would be very limited if any.

      Reviewer #2 (Public Reviews):

      Summary:

      Argunsah and colleagues demonstrate that SST-expressing interneurons are concentrated in the mouse septa and differentially respond to repetitive multi-whisker inputs. Identifying how a specific neuronal phenotype impacts responses is an advance.

      Strengths:

      (1)  Careful physiological and imaging studies.

      (2)  Novel result showing the role of SST+ neurons in shaping responses.

      (3)  Good use of a knockout animal to further the main hypothesis.

      (4)  Clear analytical techniques.

      We thank the reviewer for their appreciation of the study.

      Weaknesses:

      No major weaknesses were identified by this reviewer. Overall, I appreciated the paper but feel it overlooked a few issues and had some recommendations on how additional clarifications could strengthen the paper. These include:

      (1) Significant work from Jerry Chen on how S1 neurons that project to M1 versus S2 respond in a variety of behavioral tasks should be included (e.g. PMID: 26098757). Similarly, work from Barry Connor’s lab on intracortical versus thalamocortical inputs to SST neurons, as well as excitatory inputs onto these neurons (e.g. PMID: 12815025) should be included.

      We thank the reviewer for these valuable resources that we overlooked. We will include Chen et al. (2015), Cruikshank et al. (2007) and Gibson et al. (1999) to contextualize S1 projections and SST+ inputs, strengthening the study’s foundation as well as Beierlein et al. (2003) which nicely show both local and thalamocortical facilitation of excitatory inputs onto L4 SST neurons, in contrast to PV cells. The paper also shows the gradual recruitment of SST neurons by thalamocortical inputs to provide feed-forward inhibition onto stellate cells (regular spiking) of the barrel cortex L4 in rat.

      (2) Using Layer 2/3 as a proxy to what is happening in layer 4 (~line 234). Given that layer 2/3 cells integrate information from multiple barrels, as well as receiving direct VPm thalamocortical input, and given the time window that is being looked at can receive input from other cortical locations, it is not clear that layer 2/3 is a proxy for what is happening in layer 4.

      We agree with the reviewer that what we observe in L2/3 is not necessarily what is taking place in L4 SST-positive cells. The data on L2/3 was included to show that these cells, as a population, can show divergent responses when it comes to SWS vs MWS, which is not seen in L2/3 VIP neurons. Regardless of the mechanisms underlying it, our overall data support that SST-positive neurons can change their activation based on the type of whisker stimulus and when the excitatory input dynamics onto these neurons change due to the removal of Elfn1 the recruitment of barrels vs septa spiking changes at the temporal domain. Having said that, the data shown in Supplementary Figure 3 on the response properties of L2/3 neurons above the septa vs above the barrels (one would say in the respective columns) do show the same divergence as in L4. This suggests that a circuit motif may exist that is common to both layers, involving SST neurons that sit in L4, L5 or even L2/3. This implies that despite the differences in the distribution of SST neurons in septa vs barrels of L4 there is an unidentified input-output spatial connectivity motif that engages in both L2/3 and L4. Please also see our response to a similar point raised by reviewer 1.

      (3) Line 267, when discussing distinct temporal response, it is not well defined what this is referring to. Are the neurons no longer showing peaks to whisker stimulation, or are the responses lasting a longer time? It is unclear why PV+ interneurons which may not be impacted by the Elfn1 KO and receive strong thalamocortical inputs, are not constraining activity.

      We thank the reviewer for their comment and will clarify the statement.

      This convergence of response profiles was further clear in stimulus-aligned stacked images, where the emergent differences between barrels and septa under SWS were largely abolished in the KO (Figure 4B). A distinction between directly stimulated barrels and neighboring barrels persisted in the KO. In addition, the initial response continued to differ between barrel and septa and also septa and neighbor (Figure 4B). This initial stimulus selectivity potentially represents distinct feedforward thalamocortical activity, which includes PV+ interneuron recruitment that is not directly impacted by the Elfn1 KO (Sun et al., 2006; Tan et al., 2008). PV+ cells are strongly excited by thalamocortical inputs, but these exhibit short-term depression, as does their output, contrasting with the sustained facilitation observed in SST+ neurons. These findings suggest that in WT animals, activity spillover from principal barrels is normally constrained by the progressive engagement of SST+ interneurons in septal regions, driven by Elfn1-dependent facilitation at their excitatory synapses. In the absence of Elfn1, this local inhibitory mechanism is disrupted, leading to longer responses in barrels, delayed but stronger responses in septa, and persistently stronger responses in unstimulated neighbors, resulting in a loss of distinction between the responses of barrel and septa domains that normally diverge over time (see Author response image 1 below).

      Author response image 1.

      (A) Barrel responses are longer following whisker stimulation in KO. (B) Septal responses are slightly delayed but stronger in KO. (C) Unstimulated neighbors show longer persistent responses in KO.

       

      (4) Line 585 “the earliest CSD sink was identified as layer 4…” were post-hoc measurements made to determine where the different shank leads were based on the post-hoc histology?

      Post hoc histology was performed on plane-aligned brain sections which would allow us to detect barrels and septa, so as to confirm the insertion domains of each recorded shank. Layer specificity of each electrode therefore could therefore not be confirmed by histology as we did not have coronal sections in which to measure electrode depth.

      (5) For the retrograde tracing studies, how were the M1 and S2 injections targeted (stereotaxically or physiologically)? How was it determined that the injections were in the whisker region (or not)?

      During the retrograde virus injection, the location of M1 and S2 injections was determined by stereotaxic coordinates (Yamashita et al., 2018). After acquiring the light-sheet images, we were able to post hoc examine the injection site in 3D and confirm that the injections were successful in targeting the regions intended. Although it would have been informative to do so, we did not functionally determine the whisker-related M1 and whisker-related S2 region in this experiment.

      (6) Were there any baseline differences in spontaneous activity in the septa versus barrel regions, and did this change in the KO animals?

      Thank you for this interesting question. Our previous study found that there was a reduction in baseline activity in L4 barrel cortex of KO animals at postnatal day (P)12, but no differences were found at P21 (Stachniak et al., 2023).

      Reviewer #3 (Public Reviews):

      Summary:

      This study investigates the functional differences between barrel and septal columns in the mouse somatosensory cortex, focusing on how local inhibitory dynamics, particularly involving Elfn1-expressing SST⁺ interneurons, may mediate temporal integration of multiwhisker (MW) stimuli in septa. Using a combination of in vivo multi-unit recordings, calcium imaging, and anatomical tracing, the authors propose that septa integrate MW input in an Elfn1-dependent manner, enabling functional segregation from barrel columns.

      Strengths:

      The core hypothesis is interesting and potentially impactful. While barrels have been extensively characterized, septa remain less understood, especially in mice, and this study's focus on septal integration of MW stimuli offers valuable insights into this underexplored area. If septa indeed act as selective integrators of distributed sensory input, this would add a novel computational role to cortical microcircuits beyond what is currently attributed to barrels alone. The narrative of this paper is intellectually stimulating.

      We thank the reviewer for finding the study intellectually stimulating.

      Weaknesses:

      The methods used in the current study lack the spatial and cellular resolution needed to conclusively support the central claims. The main physiological findings are based on unsorted multi-unit activity (MUA) recorded via low-channel-count silicon probes. MUA inherently pools signals from multiple neurons across different distances and cell types, making it difficult to assign activity to specific columns (barrel vs. septa) or neuron classes (e.g., SST⁺ vs. excitatory).

      The recording radius (~50-100 µm or more) and the narrow width of septa (~50-100 µm or less) make it likely that MUA from "septal" electrodes includes spikes from adjacent barrel neurons.

      The authors do not provide spike sorting, unit isolation, or anatomical validation that would strengthen spatial attribution. Calcium imaging is restricted to SST⁺ and VIP⁺ interneurons in superficial layers (L2/3), while the main MUA recordings are from layer 4, creating a mismatch in laminar relevance.

      We thank the reviewer for pointing out the possibility of contamination in septal electrodes. Importantly, it may not have been highlighted, although reported in the methods, but we used an extremely high threshold (7.5 std, in methods, line 583) for spike detection in order to overcome the issue raised here, which restricts such spatial contaminations. Since the spike amplitude decays rapidly with distance, at high thresholds, only nearby neurons contribute to our analysis, potentially one or two. We believe that this approach provides a very close approximation of single unit activity (SUA) in our reported data. We will include a sentence earlier in the manuscript to make this explicit and prevent further confusion.

      Regarding the point on calcium imaging being performed on L2/3 SST and VIP cells instead of L4. Both reviewer 1 and 2 brought up the same issue and we responded as follows. As shown in our supplementary figure, the divergence is also observed in L2/3 where we do not have a differential distribution of SST cells, at least based on a columnar analysis extending from L4. There are multiple scenarios that could explain this “discrepancy” that one would need to examine further in future studies. One straightforward one is that the divergence in spiking in L2/3 domains may be inherited from L4 domains, where L4 SST act on. Another is that even though L2/3 SST neurons are not biased in their distribution their input-output function is, something which one would need to examine by detailed in vitro electrophysiological and perhaps optogenetic approaches in S1. Despite the distinctive differences that have been found between the L4 circuitry in S1 and V1 (Scala F et al., 2019), recent observations indicate that small but regular patches of V1 marked by the absence of muscarinic receptor 2 (M2) have high temporal acuity (Ji et al., 2015), and selectively receive input from SST interneurons (Meier et al., 2025). Regions lacking M2 have distinct input and output connectivity patterns from those that express M2 (Meier et al., 2021; Burkhalter et al., 2023). These findings, together with ours, suggest that SST cells preferentially innervate and regulate specific domains -columns- in sensory cortices.

      Furthermore, while the role of Elfn1 in mediating short-term facilitation is supported by prior studies, no new evidence is presented in this paper to confirm that this synaptic mechanism is indeed disrupted in the knockout mice used here.

      We thank Reviewer #3 for noting the absence of new evidence confirming Elfn1’s disruption of short-term facilitation in our knockout mice. We acknowledge that our study relies on previously strong published data demonstrating that Elfn1 mediates short-term synaptic facilitation of excitatory inputs onto SST+ interneurons (Sylwestrak and Ghosh, 2012; Tomioka et al., 2014; Stachniak et al., 2019, 2023). These studies consistently show that Elfn1 knockout abolishes facilitation in SST+ synapses, leading to altered temporal dynamics, which we hypothesize underlies the observed loss of barrel-septa response divergence in our Elfn1 KO mice (Figure 4). Nevertheless, to address the point raised, we will clarify in the revised manuscript (around lines 245-247 and 271-272) that our conclusions are based on these established findings, stating: “Building on prior evidence that Elfn1 knockout disrupts short-term facilitation in SST+ interneurons (Sylwestrak and Ghosh, 2012; Tomioka et al., 2014; Stachniak et al., 2019, 2023), we attribute the abolished barrel-septa divergence in Elfn1 KO mice to altered SST+ synaptic dynamics, though direct synaptic measurements were not performed here.”

      Additionally, since Elfn1 is constitutively knocked out from development, the possibility of altered circuit formation-including changes in barrel structure and interneuron distribution, cannot be excluded and is not addressed.

      We thank Reviewer #3 for raising the valid concern that constitutive Elfn1 knockout could potentially alter circuit formation, including barrel structure and interneuron distribution. To address this, we will clarify in the revised manuscript (around line ~271 and in the Discussion) that in our previous studies that included both whole-cell patch-clamp in acute brain slices ranging from postnatal day 11 to 22 (P11 - P21) and in vivo recordings from barrel cortex at P12 and P21, we saw no gross abnormalities in barrel structure, with Layer 4 barrels maintaining their characteristic size and organization, consistent with wildtype (WT) mice (Stachniak et al., 2019, 2023). While we cannot fully exclude subtle developmental changes, prior studies indicate that Elfn1 primarily modulates synaptic function rather than cortical cytoarchitecture (Tomioka et al., 2014). Elfn1 KO mice show no gross morphological or connectivity differences and the pattern and abundance of Elfn1 expressing cells (assessed by LacZ knock in) appears normal (Dolan and Mitchell, 2013).

      We will add the following to the Discussion: “Although Elfn1 is constitutively knocked out, we find here and in previous studies that barrel structure is preserved (Stachniak et al., 2019, 2023). Further, the distribution of Elfn1 expressing interneurons is not different in KO mice, suggesting minimal developmental disruption (Dolan and Mitchell, 2013).

      Nonetheless, we acknowledge that subtle circuit changes cannot be ruled out without the usage of time-depended conditional knockout of the gene.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) My biggest concern is regarding statistics. Did the authors repeatedly apply independent tests (Mann-Whitney) without any correction for multiple comparisons (Figures 1 and 4)? In that case, the chances of a spurious "significant" result rise dramatically. 

      In response to the reviewer’s comment, we now present new statistical results by utilizing ANOVA and blended these results in the manuscript between lines 172 and 192 for WT data and 282 and 298 for Elfn1 KO data. This new statistical approach shows the same differences as we had previously reported, hence consolidating the statements made. 

      (2) The findings only hint at a mechanism involving SST+ neurons for how SWS and MWS are processed differently in the barrel vs septal domains. As a direct test of SST+ neuron involvement in the divergence of barrel and septal responses, the authors might consider SST-specific manipulations - for example, inhibitory chemo- or optogenetics during SWS and MWS stimulation.

      We thank the reviewer for this comment and agree that a direct manipulation of SST+ neurons via inhibitory chemo- or opto-genetics could provide further supporting evidence for the main claims in our study. We have opted out from performing these experiments for this manuscript as we feel they can be part of a future study.  At the same time, it is conceivable that such manipulations and depending on how they are performed may lead to larger and non-specific effects on cortical activity, since SST neurons will likely be completely shut down. So even though we certainly appreciate and value the strengths of such approaches, our experiments have addressed a more nuanced hypothesis, namely that the synaptic dynamics onto SST+ neurons matter for response divergence of septa versus barrels, which could not have been easily and concretely addressed by manipulating SST+ cell firing activity.  

      (3) In general, it is hard to comprehend what microcircuit could lead to the observed divergence in the MWS/SWS ratio in the barrel vs septal domain. There preferential recruitment of SST+ neurons during MWS is not specific to a particular domain, and the higher density of SST+ neurons specifically in L4 septa cannot per se explain the diverging MWS/SWS ratio in L4 septal neurons since similar ratio divergence is observed across domains in L2/3 neurons without increase SST+ neuron density in L2/3. This view would also assume that SST+ inhibition remains contained to its own layer and domain. Is this the case? Is it that different microcircuits between barrels and septa differently shape the response to repeated MWS? This is partially discussed in the paper; can the authors develop on that? What would the proposed mechanism be? Can the short-term plasticity of the thalamic inputs (VPM vs POm) be part of the picture?

      We thank the reviewer for raising this important point. We propose that the divergence in MWS/SWS ratios across barrel and septal domains arises from dynamic microcircuit interactions rather than static anatomical features such as SST+ density, which we describe and can provide a hint. In L2/3, where SST+ density is uniform, divergence persists, suggesting that trans-laminar and trans-domain interactions are key. Barrel domains, primarily receiving VPM inputs, exhibit short-term depression onto excitatory cells and engage PV+ and SST+ neurons to stabilize the MWS/SWS ratio, with Elfn1-dependent facilitation of SST+ neurons gradually increasing inhibition during repetitive SWS. Septal domains, in contrast, are targeted by facilitating POm inputs, combined with higher L4 SST+ density and Elfn1-mediated facilitation, producing progressive inhibitory buildup that amplifies the MWS/SWS ratio. SST+ projections in septa may extend trans-laminarly and laterally, influencing L2/3 and neighboring barrels, thereby explaining L2/3 divergence despite uniform SST+ density in L2/3. In this regards, direct laminar-dependent manipulations will be required to confirm whether L2/3 divergence is inherited from L4 dynamics. In Elfn1 KO mice, the loss of facilitation in SST+ neurons likely flattens these dynamics, disrupting functional segregation. Future experiments using VPM/POm-specific optogenetic activation and SST+ silencing will be critical to directly test this model.

      We expanded the discussion accordingly.

      (4) Can the decoder generalize between SWS and MWS? In this condition, if the decoder accuracy is higher for barrels than septa, it would support the idea that septa are processing the two stimuli differently. 

      Our results show that septal decoding accuracy is generally higher than barrel accuracy when generalizing from multi-whisker stimulation (MWS) to single-whisker stimulation (SWS), indicating distinct information processing in septa compared to barrels.

      In wild-type (WT) mice, septal accuracy exceeds barrel accuracy across all time windows (150ms, 51-95ms, 1-95ms), with the largest difference in the 51-95ms window (0.9944 vs. 0.9214 at pulse 20, 10Hz stimulation). This septal advantage grows with successive pulses, reflecting robust, separable neural responses, likely driven by the posterior medial nucleus (POm)’s strong MWS integration contrasting with minimal SWS activation. Barrel responses, driven by consistent ventral posteromedial nucleus (VPM) input for both stimuli, are less distinguishable, leading to lower accuracy.

      In Elfn1 knockout (KO) mice, which disrupt excitatory drive to somatostatin-positive (SST+) interneurons, barrel accuracy is higher initially in the 1-50ms window (0.8045 vs. 0.7500 at pulse 1), suggesting reduced early septal distinctiveness. However, septal accuracy surpasses barrels in later pulses and time windows (e.g., 0.9714 vs. 0.9227 in 51-95ms at pulse 20), indicating restored septal processing. This supports the role of SST+ interneurons in shaping distinct MWS responses in septa, particularly in late-phase responses (51-95ms), where inhibitory modulation is prominent, as confirmed by calcium imaging showing stronger SST+ activation during MWS.

      These findings demonstrate that septa process SWS and MWS differently, with higher decoding accuracy reflecting structured, POm- and SST+-driven response patterns. In Elfn1 KO mice, early deficits in septal processing highlight the importance of SST+ interneurons, with later recovery suggesting compensatory mechanisms. 

      We have added Supplementary Figure 4 and included this interpretation between lines 338353. 

      We thank the reviewer for suggesting this analysis.

      (5) It is not clear to me how the authors achieve SWS. How is it that the pipette tip "placed in contact with the principal whisker" does not detach from the principal whisker or stimulate other whiskers? Please clarify the methods. 

      Targeting the specific principal whisker is performed under the stereoscope.  

      Specifically, we have added this statement in line 628:

      “We trimmed the whiskers where necessary, to avoid them touching each other and to avoid stimulating other whiskers. By putting the pipette tip very close (almost touching) to the principal whisker, the movement of the tip (limited to 1mm) would reliably move the targeted whisker. The specificity of the stimulation of the selected principal whisker was observed under the stereoscope.”

      (6) The method for calculating decoder accuracy is not clearly described-how can accuracy exceed 1? The authors should clarify this metric and provide measures of variability (e.g., confidence intervals or standard deviations across runs) to assess the significance of their comparisons. Additionally, using a consistent scale across all plots would improve interoperability. 

      We thank the reviewer for raising this point. We have now changed the way accuracies are calculated and adopted a common scale among different plots (see updated Figure 5). We have also changed the methods section accordingly.

      (7) Figure 1: The sample size is not specified. It looks like the numbers match the description in the methods, but the sample size should be clearly stated here. 

      These are the numbers the reviewer is inquiring about. 

      WT: (WT) animals: a 280 × 95 × 20 matrix for the stimulated barrel (14 Barrels, 95ms, 20 pulses), a 180 × 95 × 20 matrix for the septa (9 Septa, 95ms, 20 pulses), and a 360 × 95 × 20 matrix for the neighboring barrel (18 Neighboring barrels, 95ms, 20 pulses). N=4 mice.

      KO: 11-barrel columns, 7 septal columns, 11 unstimulated neighbors from N=4 mice.

      Panels D-F are missing axes and axis labels (firing rate, p-value). Panel D is mislabeled (left, middle, and right). I can't seem to find the yellow line. 

      Thank you for this observation. We made changes in the figures to make them easier to navigate based on the collective feedback from the reviewers.

      Why is changing the way to compare the differences in the responses to repeated stimulation between SWS and MWS? 

      To assess temporal accumulation of information, we compared responses to repeated single-whisker stimulation (SWS) and multi-whisker stimulation (MWS) using an accumulative decoding approach rather than simple per-pulse firing rates. This method captures domain-specific integration dynamics over successive pulses.

      The use of the term "principal whisker" is confusing, as it could refer to the whisker that corresponds to the recorded barrel. 

      When we use the term principal whisker, the intention is indeed to refer to the whisker corresponding to the recorded barrel during single whisker stimulation. The term principal whisker is removed from Figure legend 1 and legend S1C where it may have led to  ambiguity.    

      Why the statement "after the start of active whisking"? Mice are under anesthesia here; it does not appear to be relevant for the figure. 

      “After the start of active whisking” refers to the state of the barrel cortex circuitry at the time of recordings. The particular reference we use comes from the habit of assessing sensory processing also from a developmental point of view. The reviewer is correct that it has nothing to do the with the status of the experiment. Nevertheless, since the reviewer found that it may create confusion, we have now taken it out. 

      (8) Figure 3: The y-axis label is missing for panel C. 

      This is now fixed. (dF/F).

      (9) Figure 4: Axis labels are missing.

      Added.

      Minor: 

      (10) Line 36: "progressive increase in septal spiking activity upon multi-whisker stimulation". There is no increase in septal spiking activity upon MWS; the ratio MWS/SWS increases.

      We have changed the sentence as follows: Genetic removal of Elfn1, which regulates the incoming excitatory synaptic dynamics onto SST+ interneurons, leads to the loss of the progressive increase in septal spiking ratio (MWS/SWS) upon stimulation.

      (11) Line 105: domain-specific, rather than column-specific, for consistency.

      We have changed it.

      (12) Lines 173-174: "a divergence between barrel and septa domain activity also occurred in Layer 4 from the 2nd pulse onward (Figure 1E)". The authors only show a restricted number of comparisons. Why not show the p-values as for SWS?

      The statistics is now presented in current Figure 1E.

      (13) Lines 151-153: "Correspondingly, when a single whisker is stimulated repeatedly, the response to the first pulse is principally bottom-up thalamic-driven responses, while the later pulses in the train are expected to also gradually engage cortico-thalamo-cortical and cortico-cortical loops." Can the authors please provide a reference?

      We have now added the following references : (Kyriazi and Simons, 1993; Middleton et al., 2010; Russo et al., 2025).

      (14) Lines 184-186: "Our electrophysiological experiments show a significant divergence of responses over time upon both SWS and MWS in L4 between barrels (principal and neighboring) and adjacent septa, with minimal initial difference". The only difference between the neighboring barrel and septa is the responses to the initial pulse. Can the author clarify? 

      We have now changed the sentence as follows: Our electrophysiological experiments show a significant divergence of responses between domains upon both SWS and MWS in L4. (Line 198 now)

      (15) Line 214: "suggest these interneurons may play a role in diverging responses between barrels and septa upon SWS". Why SWS specifically?

      We have changed the sentence as follows: These results confirmed that SST+ and VIP+ interneurons have higher densities in septa compared to barrels in L4 and suggest these interneurons may play a role in diverging responses between barrels and septa. (Line 231 now).

      (16) Line 235: "This result suggests that differential activation of SST+ interneurons is more likely to be involved in the domain-specific temporal ratio differences between barrels and septa". Why? The results here are not domain-specific.

      We have now revised this statement to: This result suggested that temporal ratio differences specific to barrels and septa might involve differential activation of SST+ interneurons rather than VIP+ interneurons.

      (17) Lines 241-243: "SST+ interneurons in the cortex are known to show distinct short-term synaptic plasticity, particularly strong facilitation of excitatory inputs, which enables them to regulate the temporal dynamics of cortical circuits." Please provide a reference.

      We have now added the following references: (Grier et al., 2023; Liguz-Lecznar et al., 2016).

      (18) Lines 245-247: "A key regulator of this plasticity is the synaptic protein Elfn1, which mediates short-term synaptic facilitation of excitation on SST+ interneurons (Stachniak et al., 2021, 2019; Tomioka et al., 2014)". Is Stachniak et al., 2021 not about the role of Elf1n in excitatory-to-VIP+ neuron synapses?

      The reviewer correctly spotted this discrepancy . This reference has now been removed from this statement.

      (19) Lines 271-272: "Building on our findings that Elfn1-dependent facilitation in SST+ interneurons is critical for maintaining barrel-septa response divergence". The authors did not show that.

      We have now changed the statement to: Building on our findings that Elfn1 is critical for maintaining barrel-septa response divergence  

      (20) Line 280: second firing peak, not "peal".

      Thank you, it is now fixed.

      (21) Lines 304-305: "These results highlight the critical role of Elfn1 in facilitating the temporal integration of 305 sensory inputs through its effects on SST+ interneurons". This claim is also overstated. 

      We have now changed the statement to: These results highlight the contribution of Elfn1 to the temporal integration of sensory inputs. (Line 362)

      (22) Line 329: Any reason why not cite Chen et al., Nature 2013?

      We have now added this reference, as also pointed out by reviewer 1.

      (23) Line 341-342: "wS1" and "wS2" instead of S1 and S2 for consistency.

      Thanks, we have now updated the terms.

      Reviewer #2 (Recommendations for the authors): 

      (1) Figure 3D - the SW conditions are labeled but not the MW conditions (two right graphs) - they should be labeled similarly (SSTMW, VIPMW). 

      The two right graphs in Figure 3D represent paired SW vs MW comparisons of the evoked responses for SST and VIP populations, respectively.

      (2) Figure 6 D and E I think it would be better if the Depth measurements were to be on the yaxis, which is more typical of these types of plots. 

      We thank the reviewer for this comment. Although we appreciate this may be the case, we feel that the current presentation may be easier for the reader to navigate, and we have hence kept it. 

      (3) Having an operational definition of septa versus barrel would be useful. As the authors point out, this is a tough distinction in a mouse, and often you read papers that use Barrel Wall versus Barrel Hollow/Center - operationally defining how these areas were distinguished would be helpful. 

      We thank the reviewer for this comment and understand the point made.

      We have now updated the methods section in line 611: 

      DiI marks contained within the vGlut2 staining were defined as barrel recordings, while DiI marks outside vGlut2 staining were septal recordings.

      Reviewer #3 (Recommendations for the authors): 

      To support the manuscript's major claims, the authors should consider the following:

      (1) Validate the septal identity of the neurons studied, either anatomically or functionally at the single-cell level (e.g., via Ca²⁺ imaging with confirmed barrel/septa mapping). 

      We thank the reviewer for this suggestion, but we feel that these extensive experiments are beyond the scope of this study. 

      (2) Provide both anatomical and physiological evidence to assess the possibility of altered cortical development in Elfn1 KO mice, including potential changes in barrel structure or SST⁺ cell distribution. 

      To address the reviewer’s point, we have now added the following to the Discussion: “Although Elfn1 is constitutively knocked out, we find here and in previous studies that barrel structure is preserved (Stachniak et al., 2019, 2023). Further, the distribution of Elfn1 expressing interneurons is not different in KO mice, suggesting minimal developmental disruption (Dolan and Mitchell, 2013). Nonetheless, we acknowledge that subtle circuit changes cannot be ruled out without conditional knockouts.”,

      (3) Examine the sensory responses of SST⁺ and VIP⁺ interneurons in deeper cortical layers, particularly layer 4, which is central to the study's main conclusions.

      We thank the reviewer for this suggestion and appreciate the value it would bring to the study. We nevertheless feel that these extensive experiments are beyond the scope of this study and hence opted out from performing them. 

      Minor Comments:

      (1)  The authors used a CLARITY-based passive clearing protocol, which is known to sometimes induce tissue swelling or distortion. This may affect anatomical precision, especially when assigning neurons to narrow domains such as septa versus barrels. Please clarify whether tissue expansion was measured, corrected, or otherwise accounted for during analysis.

      Yes, the tissue expansion was accounted during analysis for the laminar specification. We excluded the brains with severe distortion. 

      (2) While the anatomical data are plotted as a function of "depth from the top of layer 4," the manuscript does not specify the precise depth ranges used to define individual cortical layers in the cleared tissue. Given the importance of laminar specificity in projection and cell type analyses, the criteria and boundaries used to delineate each layer should be explicitly stated.

      Thank you for pointing this out. We now include the criteria for delineating each layer in the manuscript. “Given that the depth of Layer 4 (L4) can be reliably measured due to its welldefined barrel boundaries, and that the relative widths of other layers have been previously characterized (El-Boustani et al., 2018), we estimated laminar boundaries proportionally. Specifically, Layer 2/3 was set to approximately 1.3–1.5 times the width of L4, Layer 5a to ~0.5 times, and Layer 5b to a similar width as L4. Assuming uniform tissue expansion across the cortical column, we extrapolated the remaining laminar thicknesses proportionally.”

      (3)  In several key comparisons (e.g., SST⁺ vs. VIP⁺ interneurons, or S2-projecting vs. M1projecting neurons), it is unclear whether the same barrel columns were analyzed across conditions. Given the anatomical and functional heterogeneity across wS1 columns, failing to control for this may introduce significant confounds. We recommend analyzing matched columns across groups or, if not feasible, clearly acknowledging this limitation in the manuscript.

      We thank the reviewer for raising this important point. For the comparison of SST⁺ versus VIP⁺ interneurons, it would in principle have been possible to analyze the same barrel columns across groups. However, because some of the cleared brains did not reach the optimal level of clarity, our choice of columns was limited, and we were not always able to obtain sufficiently clear data from the same columns in both groups. Similarly, for the analysis of S2- versus M1-projecting neurons, variability in the position and spread of retrograde virus injections made it difficult to ensure measurements from identical barrel columns. We have now added a statement in the Discussion to acknowledge this limitation.

      (4) Figure 1C: Clarify what each point in the t-SNE plot represents-e.g., a single trial, a recording channel, or an averaged response. Also, describe the input features used for dimensionality reduction, including time windows and preprocessing steps.

      In response to the reviewer’s comment, we have now added the following in the methods: In summary, each point in the t-SNE plots represents an averaged response across 20 trials for a specific domain (barrel, septa, or neighbor) and genotype (WT or KO), with approximately 14 points per domain derived from the 280 trials in each dataset. The input features are preprocessed by averaging blocks of 20 trials into 1900-dimensional vectors (95ms × 20), which are then reduced to 2D using t-SNE with the specified parameters. This approach effectively highlights the segregation and clustering patterns of neural responses across cortical domains in both WT and KO conditions.

      (5) Figures 1D, E (left panels): The y-axes lack unit labeling and scale bars. Please indicate whether values are in spikes/sec, spikes/bin, or normalized units.

      We have now clarified this. 

      (6) Figures 1D, E (right panels): The color bars lack units. Specify whether the values represent raw firing rates, z-scores, or other normalized measures. Replace the vague term "Matrix representation" with a clearer label such as "Pulse-aligned firing heatmap."

      Thank you, we have now done it.

      (7) Figure 1E (bottom panel): There appears to be no legend referring to these panels. Please define labels such as "B" and "S." 

      Thank you, we have now done it.

      (8) Figure 1E legend: If it duplicates the legend from Figure 1D, this should be made explicit or integrated accordingly. 

      We have changed the structure of this figure.

      (9) Figure 1F: Define "AUC" and explain how it was computed (e.g., area under the firing rate curve over 0-50 ms). Indicate whether the plotted values represent percentages and, if so, label the y-axis accordingly. If normalization was applied, describe the procedure. Include sample sizes (n) and specify what each data point represents (e.g., animal, recording site). 

      The following paragraph has been added in the methods section:

      The Area Under the Curve (AUC) was computed as the integral of the smoothed firing rate (spikes per millisecond) over a 50ms window following each whisker stimulation pulse, using trapezoidal integration. Firing rate data for layer 4 barrel and septal regions in wild-type (WT) and knockout (KO) mice were smoothed with a 3-point moving average and averaged across blocks of 20 trials. Plotted values represent the percentage ratio of multi-whisker (MW) to single whisker (SW) AUC with error bars showing the standard error of the mean. Each data point reflects the mean AUC ratio for a stimulation pulse across approximately 11 blocks (220 trials total). The y-axis indicates percentages.

      (10) Figure 3C: Add units to the vertical axis.

      We have added them.

      (11) Figure 3D: Specify what each line represents (e.g., average of n cells, individual responses?). 

      Each line represents an average response of a neuron.  

      (12) Figure 4C legend: Same with what?". No legend refers to the bottom panels - please revise to clarify. 

      Thank you. We have now changed the figure structure and legends and fixed the missing information issue.

      (13) Supplementary Figure 1B: Indicate the physical length of the scale bar in micrometers. 

      This has been fixed. The scale bar is 250um.

      (14) Indicate the catalog number or product name of the 8×8 silicon probe used for recordings.

      We have added this information. It is the A8x8-Edge-5mm-100-200-177-A64

      References

      (1) Beierlein, M., Gibson, J. R. & Connors, B. W. (2003). Two dynamically distinct inhibitory networks in layer 4 of the neocortex. J. Neurophysiol. 90, 2987–3000.

      (2) Burkhalter, A., D’Souza, R. D. & Ji, W. (2023). Integration of feedforward and feedback information streams in the modular architecture of mouse visual cortex. Annu. Rev. Neurosci. 46, 259–280.

      (3) Chen, J. L., Margolis, D. J., Stankov, A., Sumanovski, L. T., Schneider, B. L. & Helmchen, F. (2015). Pathway-specific reorganization of projection neurons in somatosensory cortex during learning. Nat. Neurosci. 18, 1101–1108.

      (4) Connor, J. R. & Peters, A. (1984). Vasoactive intestinal polypeptide-immunoreactive neurons in rat visual cortex. Neuroscience 12, 1027–1044.

      (5) Cruikshank, S. J., Lewis, T. J. & Connors, B. W. (2007). Synaptic basis for intense thalamocortical activation of feedforward inhibitory cells in neocortex. Nat. Neurosci. 10, 462–468.

      (6) Dolan, J. & Mitchell, K. J. (2013). Mutation of Elfn1 in mice causes seizures and hyperactivity. PLoS One 8, e80491.

      (7) Gibson, J. R., Beierlein, M. & Connors, B. W. (1999). Two networks of electrically coupled inhibitory neurons in neocortex. Nature 402, 75–79.

      (8) Ji, W., Gămănuţ, R., Bista, P., D’Souza, R. D., Wang, Q. & Burkhalter, A. (2015). Modularity in the organization of mouse primary visual cortex. Neuron 87, 632–643.

      (9) Martin-Cortecero, J. & Nuñez, A. (2014). Tactile response adaptation to whisker stimulation in the lemniscal somatosensory pathway of rats. Brain Res. 1591, 27–37.

      (10) Mégevand, P., Troncoso, E., Quairiaux, C., Muller, D., Michel, C. M. & Kiss, J. Z. (2009). Long-term plasticity in mouse sensorimotor circuits after rhythmic whisker stimulation. J. Neurosci. 29, 5326–5335.

      (11) Meier, A. M., Wang, Q., Ji, W., Ganachaud, J. & Burkhalter, A. (2021). Modular network between postrhinal visual cortex, amygdala, and entorhinal cortex. J. Neurosci. 41, 4809– 4825.

      (12) Meier, A. M., D’Souza, R. D., Ji, W., Han, E. B. & Burkhalter, A. (2025). Interdigitating modules for visual processing during locomotion and rest in mouse V1. bioRxiv 2025.02.21.639505.

      (13) Scala, F., Kobak, D., Shan, S., Bernaerts, Y., Laturnus, S., Cadwell, C. R., Hartmanis, L., Froudarakis, E., Castro, J. R., Tan, Z. H., et al. (2019). Layer 4 of mouse neocortex differs in cell types and circuit organization between sensory areas. Nat. Commun. 10, 4174.

      (14) Stachniak, T. J., Sylwestrak, E. L., Scheiffele, P., Hall, B. J. & Ghosh, A. (2019). Elfn1induced constitutive activation of mGluR7 determines frequency-dependent recruitment of somatostatin interneurons. J. Neurosci. 39, 4461–4475.

      (15) Stachniak, T. J., Kastli, R., Hanley, O., Argunsah, A. Ö., van der Valk, E. G. T., Kanatouris, G. & Karayannis, T. (2021). Postmitotic Prox1 expression controls the final specification of cortical VIP interneuron subtypes. J. Neurosci. 41, 8150–8166.

      (16) Stachniak, T. J., Argunsah, A. Ö., Yang, J. W., Cai, L. & Karayannis, T. (2023). Presynaptic kainate receptors onto somatostatin interneurons are recruited by activity throughout development and contribute to cortical sensory adaptation. J. Neurosci. 43, 7101–7118.

      (17) Sun, Q.-Q., Huguenard, J. R. & Prince, D. A. (2006). Barrel cortex microcircuits: Thalamocortical feedforward inhibition in spiny stellate cells is mediated by a small number of fast-spiking interneurons. J. Neurosci. 26, 1219–1230.

      (18) Sylwestrak, E. L. & Ghosh, A. (2012). Elfn1 regulates target-specific release probability at CA1-interneuron synapses. Science 338, 536–540.

      (19) Tan, Z., Hu, H., Huang, Z. J. & Agmon, A. (2008). Robust but delayed thalamocortical activation of dendritic-targeting inhibitory interneurons. Proc. Natl. Acad. Sci. USA 105, 2187–2192.

      (20) Tomioka, N. H., Yasuda, H., Miyamoto, H., Hatayama, M., Morimura, N., Matsumoto, Y., Suzuki, T., Odagawa, M., Odaka, Y. S., Iwayama, Y., et al. (2014). Elfn1 recruits presynaptic mGluR7 in trans and its loss results in seizures. Nat. Commun. 5, 4501.

      (21) Yamashita, T., Vavladeli, A., Pala, A., Galan, K., Crochet, S., Petersen, S. S. & Petersen, C. C. (2018). Diverse long-range axonal projections of excitatory layer 2/3 neurons in mouse barrel cortex. Front. Neuroanat. 12, 33.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This important manuscript provides insights into the competition between Splicing Factor 1 (SF1) and Quaking (QKI) for binding at the ACUAA branch point sequence in a model intron, regulating exon inclusion. The study employs rigorous transcriptomic, proteomic, and reporter assays, with both mammalian cell culture and yeast models. Nevertheless, while the data are convincing, broadening the analysis to additional exons and narrowing the manuscript's title to better align with the experimental scope would strengthen the work.

      Public Reviews:

      Reviewer #1 (Public review):

      In this manuscript, the authors aimed to show that SF1 and QKI compete for the intron branch point sequence ACUAA and provide evidence that QKI represses inclusion when bound to it.

      Major strengths of this manuscript include:

      (1) Identification of the ACUAA-like motif in exons regulated by QKI and SF1.

      (2) The use of the splicing reporter and mutant analysis to show that upstream and downstream ACUAAC elements in intron 10 of RAI are required for repressing splicing.

      (3) The use of proteomic to identify proteins in C2C12 nuclear extract that binds to the wild type and mutant sequence.

      (4) The yeast studies showing that ectopic lethality when Qki5 expression was induced, due to increased mis-splicing of transcripts that contain the ACUAA element.

      The authors conclusively show that the ACUAA sequence is bound by QKI and provide strong evidence that this leads to differences in exons inclusion and exclusion. In animal cells, and especially in human, branchpoint sequences are degenerate but seem to be recognized by specific splicing factors. Although a subset of splicing factors shows tissue-specific expression patterns most don't, suggesting that yet-to-be-identified mechanisms regulate splicing. This work suggests that an alternate mechanism could be related to the binding affinity of specific RNA binding factors for branchpoint sequences coupled with the level of these different splicing factors in a given cell.

      We thank the reviewer for the positive comments.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Pereira de Castro and coworkers are studying potential competition between a more standard splicing factor SF1, and an alternative splicing factor called QK1. This is interesting because they bind to overlapping sequence motifs and could potentially have opposing effects on promoting the splicing reaction. To test this idea, the authors KD either SF1 or QK1 in mammalian cells and uncover several exons whose splicing regulation follows the predicted pattern of being promoted for splicing by SF1 and repressed by QK1. Importantly, these have introns enriched in SF1 and QK1 motifs. The authors then focus on one exon in particular with two tandem motifs to study the mechanism of this in greater detail and their results confirm the competition model. Mass spec analysis largely agrees with their proposal; however, it is complicated by the apparently quick transition of SF1-bound complexes to later splicing intermediates. An inspired experiment in yeast shows how QK1 competition could potentially have a detrimental impact on splicing in an orthogonal system. Overall, these results show how splicing regulation can be achieved by competition between a "core" and alternative splicing factor and provide additional insight into the complex process of branch site recognition. The manuscript is exceptionally clear and the figures and data are very logically presented. The work will be valuable to those in the splicing field who are interested in both mechanism and bioinformatics approaches to deconvolve any apparent "splicing code" being used by cells to regulate gene expression. Criticisms are minor and the most important of them stem from overemphasis on parts of the manuscript on the evolutionary angle when evolution itself wasn't analyzed per se.

      We thank the reviewer for the positive comments and very clear and fair critical points.

      Strengths:

      (1) The main discovery of the manuscript involving evidence for SF1/QK1 competition is quite interesting and important for this field. This evidence has been missing and may change how people think about branch site recognition.

      (2) The experiments and the rationale behind them are exceptionally clearly and logically presented. This was wonderful!

      Thank you so much. We felt the overall flow of the paper and data make for a nice “story” that conveys a relatively easy-to-understand explanation for a complex subject.

      (3) The experiments are carried out to a high standard and well-designed controls are included.

      (4) The extrapolation of the result to yeast in order to show the potentially devastating consequences of the QK1 competition was very exciting and creative.

      We agree this is a very exciting result and finding! Thanks.

      Weaknesses:

      Overall the weaknesses are relatively minor and involve cases where clarification is necessary, some additional analysis could bolster the arguments, and suggestions for focusing the manuscript on its strengths.

      (1) The title (Ancient...evolutionary outcomes), abstract, and some parts of the discussion focus heavily on the evolutionary implications of this work. However, evolutionary analysis was not performed in these studies (e.g., when did QK1 and SF1 proteins arise and/or diverge? How does this line up with branch site motifs and evolution of U2? Any insight from recent work from Scott Roy et al?). I think this aspect either needs to be bolstered with experimental work/data or this should be tamped down in the manuscript. I suggest highlighting the idea expressed in the sentence "A nuanced implication of this model is that loss-of-function...". To me, this is better supported by the data and potentially by some analysis of mutations associated with human disease.

      We have revised the title and dampened the evolutionary aspects of the previous version of the manuscript.

      (2) One paper that I didn't see cited was that by Tanackovic and Kramer (Mol Biol Cell 2005). This paper is relevant because they KD SF1 and found it nonessential for splicing in vivo. Do their results have implications for those here? How do the results of the KD compare? Could QK1 competition have influenced their findings (or does their work influence the "nuanced implication" model referenced above?)?

      This is an interesting point, and thank you for the suggestion. We have now included a brief description of this study in the Introduction of the revised manuscript and do note that the authors measured intron retention of a beta globin reporter and SF3A1, SF3A2, and SF3A3 during SF1 knockdown, but did not detect elevated unspliced RNA in these targets.

      (3) Can the authors please provide a citation for the statement "degeneracy is observed to a higher degree in organisms with more alternative splicing"? Does recent evolutionary analysis support this?

      We have removed the statement, as it did not add much to the content and I am not sure I can state the concept I was attempting to convey in a simple manner with few citations.

      (4) For the data in Figure 3, I was left wondering if NMD was confounding this analysis. Can the authors respond to this and address this concern directly?

      We have not measured if the reporters used in Figure 3 produce protein(s). Presumably, though, all spliced reporter RNA would be degraded equally (the included/skipped isoforms’ “reading frames” are not altered from one another). This would not be case for unspliced nuclear reporter RNA, however. Given this difference, and that our analysis can not resolve the subcellular localization of the different reporter species, we have removed the measurement of and subsequent results describing unspliced reporter RNA from Figure 3.

      (5) To me, the idea that an engaged U2 snRNP was pulled down in Figure 4F would be stronger if the snRNA was detected. Was that able to be observed by northern or primer extension? Would SF1 be enriched if the U2 snRNA was degraded by RNaseH in the NE?

      We did not measure any co-associating RNAs in this experimental approach, but agree that this approach would strengthen the evidence for it.

      (6) I'm wondering how additive the effects of QK1 and SF1 are... In Figure 2, if QK1 and SF1 are both knocked down, is the splicing of exon 11 restored to "wt" levels?

      This is an interesting question that we were unfortunately unable to address experimentally here.

      (7) The first discussion section has two paragraphs that begin "How does competition between SF1..." and "Relatively little is known about how...". I found the discussion and speculation about localization, paraspekles, and lncRNAs interesting but a bit detracting from the strengths of the manuscript. I would suggest shortening these two paragraphs into a single one.

      We have revised the Discussion.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript, the authors were trying to establish whether competition between the RNA-binding proteins SF1 and QKI controlled splicing outcomes. These two proteins have similar binding sites and protein sequences, but SF1 lacks a dimerization motif and seems to bind a single version of the binding sequence. Importantly, these binding sequences correspond to branchpoint consensus sequences, with SF1 binding leading to productive splicing, but QKI binding leading instead to association with paraspeckle proteins. They show that in human cells SF1 generally activates exons and QKI represses, and a large group of the jointly regulated exons (43% of joint targets) are reciprocally controlled by SF1 and QKI. They focus on one of these exons RAI14 that shows this reciprocal pattern of regulation, and has 2 repeats of the binding site that make it a candidate for joint regulation, and confirm regulation within a minigene context. The authors used the assembly of proteins within nuclear extracts to explain the effect of QKI versus SF1 binding. Finally, the authors show that the expression of QKI is lethal in yeast, and causes splicing defects.

      How this fits in the field. This study is interesting and provides a conceptual advance by providing a general rule on how SF1 and QKI interact in relation to binding sites, and the relative molecular fates followed, so is very useful. Most of the analysis seems to focus on one example, although the molecular analysis and global work significantly add to the picture from the previously published paper about NUMB joint regulation by QKI and SF (Zong et al, cited in text as reference 50, that looked at SF1 and QKI binding in relation to a duplicated binding site/branchpoint sequence in NUMB).

      Thank you for the encouraging remarks.

      Strengths:

      The data presented are strong and clear. The ideas discussed in this paper are of wide interest, and present a simple model where two binding sites generate a potentially repressive QKI response, whereas exons that have a single upstream sequence are just regulated by SF1. The assembly of splicing complexes on RNAs derived from RAI14 in nuclear extracts, followed by mass spec gave interesting mechanistic insight into what was occurring as a result of QKI versus SF1 binding.

      Weaknesses:

      I did not think the title best summarises the take-home message and could be perhaps a bit more modest. Although the authors investigated splicing patterns in yeast and human cells, yeast do not have QKI so there is no ancient competition in that case, and the study did not really investigate physiological or evolutionary outcomes in splicing, although it provides interesting speculation on them. Also as I understood it, the important issue was less conserved branchpoints in higher eukaryotes enabling alternative splicing, rather than competition for the conserved branchpoint sequence. So despite the the data being strong and properly analysed and discussed in the paper, could the authors think whether they fit best with the take-home message provided in the title? Just as a suggestion (I am sure the authors can do a better job), maybe "molecular competition between variant branchpoint sequences predict physiological and evolutionary outcomes in splicing"?

      Thank you for this point (Reviewer 2 had a similar comment) and the suggestion. We have revised the title.

      Although the authors do provide some global data, most of the detailed analysis is of RAI14. It would have been useful to examine members of the other quadrants in Figure 1C as well for potential binding sites to give a reason why these are not co-regulated in the same way as RAI14. How many of the RAI14 quadrants had single/double sites (the motif analysis seemed to pull out just one), and could one of the non-reciprocally regulated exons be moved into a different quadrant by addition or subtraction of a binding site or changing the branchpoint (using a minigene approach for example).

      This is an interesting point that we have considered. Our intent with the focus on RAI14 was to use a naturally occurring intron bps with evidence of strong QKI binding that did not require a high degree of sequence manipulation or engineering.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Most of my recommendations are really centered on the figures. In their current state, they detract from the data shown and could be improved: I recommend the authors use a uniform font. For example, Figure 1E and F have at least three different fonts of varying sizes making it very messy. In Figure 1C, the authors could bold the Ral14 ex11 or simply indicate that the blue is this exon in the legend, thus removing the text from this very busy graph. In Figure 4F, I would recommend, having all the labels the same size and putting those genes of interest like Sf3a1 in bold. This could also be done in Figure 4E.

      Thank you for the suggestion and we have edited these (FYI the font in Fig’s 1E and 1F were from the rMAPS default output, but I agree, it gives a sloppy appearance).

      (2) In Figures 4D and 4G, is there QKI binding to the downstream deletion mutant after 30 minutes? Also, in Figure 4G, are these all from the same blot? The band sizes seem to be very different between lanes. If these were not on the same blot, the original gels should be submitted.

      A small amount of Qki appears to be binding after 30 min. All lanes/blots are from the same gels/membranes; see new Supplemental Figure 4 for the original (uncropped) images of the blots.

      (3) The authors should indicate, the source and concentration of the antibodies used for their WB. They should also indicate the primers used for RT-PCRs.

      We have revised the methods to include the antibody information and have uploaded a supplemental table 8 with all oligonucleotide sequences used (which I (Sam Fagg) neglected to do initially, so that’s my bad).

      Reviewer #2 (Recommendations for the authors):

      (1) This may come down to the author's preference but branch point and branch site are frequently two words, not a single compound word (branch point vs. branchpoint). In addition, the authors may want to use branchsite with the abbreviation BS more frequently since they often don't describe the specific point of branching, and bp and bps could be confused for the more frequent abbreviations for base pair(s).

      Good suggestion; we have edited the text accordingly.

      (2) In general the addition of page numbers and line numbers to the manuscript would greatly aid reviewers!

      Point taken…

      (3) Introduction; "...under normal growth conditions they are efficiently spliced". I would say MOST introns in yeast are efficiently spliced. This is definitely not universal.

      Text edited to indicate that most are efficiently spliced.

      (4) Introduction; " recognition of the bps by SF1 (mammals) (20)". The choice of reference 20 is an odd one here. I think the Robin Reed and Michael Rosbash paper was the first to show SF1 was the human homolog of BBP.

      Got it, thanks (added #14 here and kept #20 also since it shows the structure of SF1 in complex with a UACUAAC bps.)

      (5) Results; "QK1 and SF1 co-regulate.."; it may be useful for the reader if you could explain in more detail why exon inclusion and intron retention are expected outcomes for QK1 knockdown and vice versa for SF1. The exon inclusion here is more obvious than the intron retention phenotype. (In other words, if more exons are included shouldn't it follow that more introns are removed?)

      We explain the expected results for exon inclusion in the Introduction and this paragraph of the Results. Although we have observed more intron retention under QKI loss-of-function approaches before, I am uncertain where the reviewer sees that we indicate any expected result for intron retention from either QKI or SF1 knockdown. I believe the statement you refer to might be on line 162 and starts with: “Consistent with potentially opposing functions in splicing…” ?

      Also, I agree that if SF1 is a “splicing activator,” one might expect more IR in its absence (but this is not the case; there is, in fact, less), but nonetheless, the opposite outcome is observed with QKI knockdown (more IR). It is unclear why this is the case, and we did not investigate it.

      (6) Results; "QK1 and SF1 co-regulate.."; "Thus the most highly represented set.." To me, the most highly represented set is those which are not both QK1-repressed and SF1-activated. Does this indicate that other factors are involved at most sites than simple competition between these two?

      We have revised the sentence in question to include the text “by quadrant” in order to convey our meaning more precisely.

      (7) Throughout the manuscript, 5 apostrophes and 3 apostrophes are used instead of 5 prime symbols and 3 prime symbols.

      Thank you for pointing that out. We have fixed each instance of this.

      (8) Sometimes SF1 is written as Sf1. (also Tatsf1)

      This was a mouse/human gene/protein nomenclature error that we have fixed; thank you for pointing this out.

      (9) You may want to make sure that figures are labeled consistently with the manuscript text. In Figure 1B, it is RI rather than IR. In Figure 4 it is myoblast NE rather than C2C12 nuclear extract.

      We have fixed these, checked for other examples, and where relevant, edited those too.

      (10) I think Figure 1A could be improved by also including a depiction of the domain arrangements of SF1 and QK1.

      Done.

      (11) I was a bit confused with all the lines in Figure 1E and 1F. What is the difference between the log (pVal) and upregulated plots? Can these figures be simplified or explained more thoroughly?

      Based on this comment and one from Reviewer 1, we have slightly revised the wording (and font) on the output, which hopefully clarifies. These are motif enrichment plots generated by rMAPS (Refs 61 and 62) analysis of rMATS (Ref 60) data for exons more included (depicted by the red lines) or more skipped (depicted by the blue lines) compared to control versus a “background” set of exons that are detectable but unchanged. The -log<sub>10</sub> is P-value (dotted line) indicates the significance of exons more included in shRNA treatment vs control shRNA (previously read “upregulated”) compared to background exons that are detectable but unchanged; the solid lines indicate the motif score; these are described in the references indicated.

      (12) Figure 1B, it is a bit hard to conclude that there is more AltEx or "RI/IR" in one sample vs. the other from these plots since the points overlay one another. Can you include numbers here?

      Added (and deleted Suppl Fig S1, which was simply a chart showing the numbers).

      (13) How was PSI calculated in Figure 2A?

      VAST-tools (we state this in the legend in the revised version).

      You may want to include rel protein (or the lower limit of detection) for Figure 2B to be consistent with 2C. Why is KD of SF1 so poor and variable between 2C and 2D?

      We have not investigated this, but these blots show an optimized result that we were able to obtain for the knockdown in each cell type. It may be that HEK293 cells (Fig 2B) have a stronger requirement for SF1 than C2C12 cells…? I would argue that it is not necessarily “poor” in Fig 2C, as we observe ~70% depletion of the protein.

      Why are two bands present in the gel?

      Two to three isoforms of SF1 are present in most cell types.

      A good (or bad, really) example of an SF1 western blot (and knockdown of ~35% in K562 or ~45% in HepG2 can also be seen on the ENCODE project website, for reference:

      https://www.encodeproject.org/documents/6001a414-b096-4073-94ff-3af165617eb5/@@download/attachment/SF1_BGKLV28-49.pdf

      By comparison, I think ours are much more cosmetically pleasing, and our knockdown (especially in C2C12) is much more efficient.

      (14) Figure 3, The asterisk refers to a cryptic product. Can the uaAcuuuCAG be used as a branch point? Presumably the natural 3' SS is now too close so this would result in activation of a downstream 3'SS?

      We did not pursue determining the identity of this minor and likely artefactual product, but we (and others) have observed a similar phenomenon when using splicing reporter-based mutational approaches.

      (15) For the methods. The "RNA extraction, RT -PCR,..." subheading needs to be on its own line. Please add (w/v) or (v/v) to percentages where appropriate. Please convert ug to the symbol for "micro".

      Thank you, we have made these changes.

      (16) In Figure 4B, the text here and legend are microscopic. Even with reading glasses, I couldn't make anything out!

      We have increased the font sizes for the text and scale bar…when referring to “legend” does the reviewer mean the scale bar?

      (17) As a potential discussion item, it is worth noting that SF1 could also repress splicing if it could either not engage with U2AF or be properly displaced by U2 snRNP so the snRNA could pair. I was wondering if QK1 could similarly be activating if it could engage with U2AF. I'm unsure if this could be tested by domain swaps (and is beyond the scope of this paper). It just may be worth speculating about.

      Good point and suggestion…we are looking into this.

      Reviewer #3 (Recommendations for the authors):

      (1) Is the reference in the text to Figure 5F correct for actin splicing (this is just before the discussion)?

      I see references several lines up from this, but I do not see a reference just before the discussion…?

      (2) I was not sure why the minigene experiments showed such high levels of intron retention that seemed to be impacted also by deletion of the branchpoint sequences, and suggest that the two branchpoints are not equal in strength.

      Neither were we, but Reviewer 2 has suggested that degradation of the spliced products could be rapid (NMD substrates) which could complicate the interpretation of what appears to be higher levels of intron retention. Given the possibility that this could be a non-physiological artefact, we have removed the measurement of unspliced reporter and now only show the spliced products (equally subject to degradation) and report their percent inclusion.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank the editors of eLife and the reviewers for their thorough evaluation of our study. As regards the final comments of reviewer 1 please note that all experimental replicates were first analyzed separately, and were then pooled, since the observed changes were comparable between experiments. This mean that statistical analyses were done on pooled biological replicates.


      The following is the authors’ response to the original reviews.

      General Statements

      We thank the reviewers for their thorough and constructive evaluation of our work. We have revised the manuscript carefully and addressed all the criticisms raised, in particular the issues mentioned by several of the reviewers (see point-by-point response below). We have also added a number of explanations in the text for the sake of clarity, while trying to keep the manuscript as concise as possible.

      In our view, the novelty of our research is two-fold. From a neurobiological point of view, we provide conclusive evidence for the existence of glycine receptors (GlyRs) at inhibitory synapses in various brain regions including the hippocampus, dentate gyrus and sub-regions of the striatum. This solves several open questions and has fundamental implications for our understanding of the organisation and function of inhibitory synapses in the telencephalon. Secondly, our study makes use of the unique sensitivity of single molecule localisation microscopy (SMLM) to identify low protein copy numbers. This is a new way to think about SMLM as it goes beyond a mere structural characterisation and towards a quantitative assessment of synaptic protein assemblies.

      Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity): 

      In this manuscript, the authors investigate the nanoscopic distribution of glycine receptor subunits in the hippocampus, dorsal striatum, and ventral striatum of the mouse brain using single-molecule localization microscopy (SMLM). They demonstrate that only a small number of glycine receptors are localized at hippocampal inhibitory synapses. Using dual-color SMLM, they further show that clusters of glycine receptors are predominantly localized within gephyrinpositive synapses. A comparison between the dorsal and ventral striatum reveals that the ventral striatum contains approximately eight times more glycine receptors and this finding is consistent with electrophysiological data on postsynaptic inhibitory currents. Finally, using cultured hippocampal neurons, they examine the differential synaptic localization of glycine receptor subunits (α1, α2, and β). This study is significant as it provides insights into the nanoscopic localization patterns of glycine receptors in brain regions where this protein is expressed at low levels. Additionally, the study demonstrates the different localization patterns of GlyR in distinct striatal regions and its physiological relevance using SMLM and electrophysiological experiments. However, several concerns should be addressed. 

      The following are specific comments: 

      (1) Colocalization analysis in Figure 1A. The colocalization between Sylite and mEos-GlyRβ appears to be quite low. It is essential to assess whether the observed colocalization is not due to random overlap. The authors should consider quantifying colocalization using statistical methods, such as a pixel shift analysis, to determine whether colocalization frequencies remain similar after artificially displacing one of the channels. 

      Following the suggestion of reviewer 1, we re-analysed CA3 images of Glrb<sup>eos/eos</sup> hippocampal slices by applying a pixel-shift type of control, in which the Sylite channel (in far red) was horizontally flipped relative to the mEos4b-GlyRβ channel (in green, see Methods). As expected, the number of mEos4b-GlyRβ detections per gephyrin cluster was markedly reduced compared to the original analysis (revised Fig. 1B), confirming that the synaptic mEos4b detections exceed chance levels (see page 5). 

      (2) Inconsistency between Figure 3A and 3B. While Figure 3B indicates an ~8-fold difference in the number of mEos4b-GlyRβ detections per synapse between the dorsal and ventral striatum, Figure 3A does not appear to show a pronounced difference in the localization of mEos4bGlyRβ on Sylite puncta between these two regions. If the images presented in Figure 3A are not representative, the authors should consider replacing them with more representative examples or providing an expanded images with multiple representative examples. Alternatively, if this inconsistency can be explained by differences in spot density within clusters, the authors should explain that. 

      The pointillist images in Fig. 3A are essentially binary (red-black). Therefore, the density of detections at synapses cannot be easily judged by eye. For clarity, the original images in Fig. 3A have been replaced with two other examples that better reflect the different detection numbers in the dorsal and ventral striatum. 

      (3) Quantification in Figure 5. It is recommended that the authors provide quantitative data on cluster formation and colocalization with Sylite puncta in Figure 5 to support their qualitative observations. 

      This is an important point that was also raised by the other reviewers. We have performed additional experiments to increase the data volume for analysis. For quantification, we used two approaches. First, we counted the percentage of infected cells in which synaptic localisation of the recombinant receptor subunit was observed (Fig. 5C). We found that mEos4b-GlyRa1 consistently localises at synapses, indicating that all cells express endogenous GlyRb. When neurons were infected with mEos4b-GlyRb, fewer cells had synaptic clusters, meaning that indeed, GlyR alpha subunits are the limiting factor for synaptic targeting. In cultures infected with mEos4b-GlyRa2, only very few neurons displayed synaptic localisation (as judged by epifluorescence imaging). We think this shows that GlyRa2 is less capable of forming heteromeric complexes than GlyRa1, in line with our previous interpretation (see pp. 9-10, 13). 

      Secondly, we quantified the total intensity of each subunit at gephyrin-positive domains, both in infected neurons as well as non-infected control cultures (Fig. 5D). We observed that mEos4bGlyRa1 intensity at gephyrin puncta was higher than that of the other subunits, again pointing to efficient synaptic targeting of GlyRa1. Gephyrin cluster intensities (Sylite labelling) were not significantly different in GlyRb and GlyRa2 expressing neurons compared to the uninfected control, indicating that the lentiviral expression of recombinant subunits does not fundamentally alter the size of mixed inhibitory synapses in hippocampal neurons. Interestingly, gephyrin levels were slightly higher in hippocampal neurons expressing mEos4b-GlyRa1. In our view, this comes from an enhanced expression and synaptic targeting of mEos4b-GlyRa1 heteromers with endogenous GlyRb, pointing to a structural role of GlyRa1/b in hippocampal synapses (pp. 10, 13).

      The new data and analyses have been described and illustrated in the relevant sections of the manuscript.

      (4) Potential for pseudo replication. It's not clear whether they're performing stats tests across biological replica, images, or even synapses. They often quote mean +/- SEM with n = 1000s, and so does that mean they're doing tests on those 1000s? Need to clarify. 

      All experiments were repeated at least twice to ensure reproducibility (N independent experiments). Statistical tests were performed on pooled data across the biological replicates; n denotes the number of data points used for testing (e.g., number of synaptic clusters, detections, cells, as specified in each case). We have systematically given these numbers in the revised manuscript (n, N, and other experimental parameters such as the number of animals used, coverslips, images or cells). Data are generally given as mean +/- SEM or as mean +/- SD as indicated.

      (5) Does mEoS effect expression levels or function of the protein? Can't see any experiments done to confirm this. Could suggest WB on homogenate, or mass spec? 

      The Glrb<sup>eos/eos</sup> knock-in mouse line has been characterised previously and does not to display any ultrastructural or functional deficits at inhibitory synapses (Maynard et al. 2021 eLife). GlyRβ expression and glycine-evoked responses were not significantly different to those of the wildtype. The synaptic localisation of mEos4b-GlyRb in KI animals demonstrates correct assembly of heteromeric GlyRs and synaptic targeting. Accordingly, the animals do not display any obvious phenotype. We have clarified this in the manuscript (p. 4). In the case of cultured neurons, long-term expression of fluorescent receptor subunits with lentivirus   has proven ideal to achieve efficient synaptic targeting. The low and continuous supply of recombinant receptors ensures assembly with endogenous subunits to form heteropentameric receptor complexes (e.g. [Patrizio et al. 2017 Sci Rep]). In the present study, lentivirus infection did not induce any obvious differences in the number or size of inhibitory synapses compared to control neurons, as judged by Sylite labelling of synaptic gephyrin puncta (new Fig. 5D).

      (6) Quantification of protein numbers is challenging with SMLM. Issues include i) some of FP not correctly folded/mature, and ii) dependence of localisation rate on instrument, excitation/illumination intensities, and also the thresholds used in analysis. Can the authors compare with another protein that has known expression levels- e.g. PSD95? This is quite an ask, but if they could show copy number of something known to compare with, it would be useful. 

      We agree that absolute quantification with SMLM is challenging, since the number of detections depends on fluorophore maturation, photophysics, imaging conditions, and analysis thresholds (discussed in Patrizio & Specht 2016, Neurophotonics). For this reason, only very few datasets provide reliable copy numbers, even for well-studied proteins such as PSD-95. One notable exception is the study by Maynard et al. (eLife 2021) that quantified endogenous GlyRβcontaining receptors in spinal cord synapses using SMLM combined with correlative electron microscopy. The strength of this work was the use of a KI mouse strain, which ensures that mEos4b-GlyRβ expression follows intrinsic regional and temporal profiles. The authors reported a stereotypic density of ~2,000 GlyRs/µm² at synapses, corresponding to ~120 receptors per synapse in the dorsal horn and ~240 in the ventral horn, taking into account various parameters including receptor stoichiometry and the functionality of the fluorophore. These values are very close to our own calculations of GlyR numbers at spinal cord synapses that were obtained slightly differently in terms of sample preparation, microscope setup, imaging conditions, and data analysis, lending support to our experimental approach. Nevertheless, the obtained GlyR copy numbers at hippocampal synapses clearly have to be taken as estimates rather than precise figures, because the number of detections from a single mEos4b fluorophore can vary substantially, meaning that the fluorophores are not represented equally in pointillist images. This can affect the copy number calculation for a specific synapse, in particular when the numbers are low (e.g. in hippocampus), however, it should not alter the average number of detections (Fig. 1B) or the (median) molecule numbers of the entire population of synapses (Fig. 1C). We have discussed the limitations of our approach (p. 11).

      (7) Rationale for doing nanobody dSTORM not clear at all. They don't explain the reason for doing the dSTORM experiments. Why not just rely on PALM for coincidence measurements, rather than tagging mEoS with a nanobody, and then doing dSTORM with that? Can they explain? Is it to get extra localisations- i.e. multiple per nanobody? If so, localising same FP multiple times wouldn't improve resolution. Also, no controls for nanobody dSTORM experiments- what about non-spec nb, or use on WT sections? 

      As discussed above (point 6), the detection of fluorophores with SMLM is influenced by many parameters, not least the noise produced by emitting molecules other than the fluorophore used for labelling. Our study is exceptional in that it attempts to identify extremely low molecule numbers (down to 1). To verify that the detections obtained with PALM correspond to mEos4b, we conducted robust control experiments (including pixel-shift as suggested by the reviewer, see point 1, revised Fig. 1B). The rationale for the nanobody-based dSTORM experiments was twofold: (1) to have an independent readout of the presence of low-copy GlyRs at inhibitory synapses and (2) to analyse the nanoscale organisation of GlyRs relative to the synaptic gephyrin scaffold using dual-colour dSTORM with spectral demixing (see p. 6). The organic fluorophores used in dSTORM (AF647, CF680) ensure high photon counts, essential for reliable co-localisation and distance analysis. PALM and dSTORM cannot be combined in dual-colour mode, as they require different buffers and imaging conditions. 

      The specificity of the anti-Eos nanobody was demonstrated by immunohistochemistry in spinal cord cultures expressing mEos4b-GlyRb and wildtype control tissue (Fig. S3). In response to the reviewer's remarks, we also performed a negative control experiment in Glrb<sup>eos/eos</sup> slices (dSTORM), in which the nanobody was omitted (new Fig. S4F,G). Under these conditions, spectral demixing produced a single peak corresponding to CF680 (gephyrin) without any AF647 contribution (Fig. S4F). The background detection of "false" AF647 detections at synapses was significantly lower than in the slices labelled with the nanobody. We conclude that the fluorescence signal observed in our dual-colour dSTORM experiments arises from the specific detection of mEos4b-GlyRb by the nanobody, rather than from background, crossreactivity or wrong attribution of colour during spectral demixing. We have added these data and explanations in the results (p. 7) and in the figure legend of Fig. S4F,G.

      (8) What resolutions/precisions were obtained in SMLM experiments? Should perform Fourier Ring Correlation (FRC) on SR images to state resolutions obtained (particularly useful for when they're presenting distance histograms, as this will be dependent on resolution). Likewise for precision, what was mean precision? Can they show histograms of localisation precision. 

      This is an interesting question in the context of our experiments with low-copy GlyRs, since the spatial resolution of SMLM is limited also by the density of molecules, i.e. the sampling of the structure in question (Nyquist-Shannon criterion). Accordingly, the priority of the PALM experiments was to improve the sensibility of SMLM for the identification of mEos4b-GlyRb subunits, rather than to maximize the spatial resolution. The mean localisation precision in PALM was 33 +/- 12 nm, as calculated from the fitting parameters of each detection (Zeiss, ZEN software), which ultimately result from their signal-to-noise ratio. This is a relatively low precision for SMLM, which can be explained by the low brightness of mEos4b compared to organic fluorophores together with the elevated fluorescence background in tissue slices.

      In the case of dSTORM, the aim was to study the relative distribution of GlyRs within the synaptic scaffold, for which a higher localisation precision was required (p. 6). Therefore, detections with a precision ≥ 25 nm were filtered during analysis with NEO software (Abbelight). The retained detections had a mean localisation precision of 12 +/- 5 for CF680 (Sylite) and 11 +/- 4 for AF647 (nanobody). These values are given in the revised manuscript (pp. 18, 22).

      (9) Why were DBSCAN parameters selected? How can they rule out multiple localisations per fluor? If low copy numbers (<10), then why bother with DBSCAN? Could just measure distance to each one. 

      Multiple detections of the same fluorophore are intrinsic to dSTORM imaging and have not been eliminated from the analysis. Small clusters of detections likely represent individual molecules (e.g. single receptors in the extrasynaptic regions, Fig. 2A). DBSCAN is a robust clustering method that is quite insensitive to minor changes in the choice of parameters. For dSTORM of synaptic gephyrin clusters (CF680), a relatively low length (80 nm radius) together with a high number of detections (≥ 50 neighbours) were chosen to reconstruct the postsynaptic domain with high spatial resolution (see point 8). In the case of the GlyR (nanobody-AF647), the clustering was done mostly for practical reasons, as it provided the coordinates of the centre of mass of the detections. The low stringency of this clustering (200 nm radius, ≥ 5 neighbours) effectively filters single detections that can result from background noise or incorrect demixing. An additional reference explaining the use of DBSCAN including the choice of parameters is given on p. 22 (see also R2 point 4).

      (10) For microscopy experiment methods, state power densities, not % or "nominal power". 

      Done. We now report the irradiance (laser power density) instead of nominal power (pp. 18, 21). 

      (11) In general, not much data presented. Any SI file with extra images etc.? 

      The original submission included four supplementary figures with additional data and representative images that should have been available to the reviewer (Figs. S1-S4). The SI file has been updated during revision (new Fig. S4E-G). 

      (12) Clarification of the discussion on GlyR expression and synaptic localization: The discussion on GlyR expression, complex formation, and synaptic localization is sometimes unclear, and needs terminological distinctions between "expression level", "complex formation" and "synaptic localization". For example, the authors state:"What then is the reason for the low protein expression of GlyRβ? One possibility is that the assembly of mature heteropentameric GlyR complexes depends critically on the expression of endogenous GlyR α subunits." Does this mean that GlyRβ proteins that fail to form complexes with GlyRα subunits are unstable and subject to rapid degradation? If so, the authors should clarify this point. The statement "This raises the interesting possibility that synaptic GlyRs may depend specifically on the concomitant expression of both α1 and β transcripts." suggests a dependency on α1 and β transcripts. However, is the authors' focus on synaptic localization or overall protein expression levels? If this means synaptic localization, it would be beneficial to state this explicitly to avoid confusion. To improve clarity, the authors should carefully distinguish between these different aspects of GlyR biology throughout the discussion. Additionally, a schematic diagram illustrating these processes would be highly beneficial for readers. 

      We thank the reviewer to point this out. We are dealing with several processes; protein expression that determines subunit availability and the assembly of pentameric GlyRs complexes, surface expression, membrane diffusion and accumulation of GlyRb-containing receptor complexes at inhibitory synapses. We have edited the manuscript, particularly the discussion and tried to be as clear as possible in our wording.

      We chose not to add a schematic illustration for the time being, because any graphical representation is necessarily a simplification. Instead, we preferred to summarise the main numbers in tabular form (Table 1). We are of course open to any other suggestions.

      (13) Interpretation of GlyR localization in the context of nanodomains. The distribution of GlyR molecules on inhibitory synapses appears to be non-homogeneous, instead forming nanoclusters or nanodomains, similar to many other synaptic proteins. It is important to interpret GlyR localization in the context of nanodomain organization. 

      The dSTORM images in Fig. 2 are pointillist representations that show individual detections rather than molecules. Small clusters of detections are likely to originate from a single AF647 fluorophore (in the case of nanobody labelling) and therefore represent single GlyRb subunits. Since GlyR copy numbers are so low at hippocampal synapses (≤ 5), the notion of nanodomain is not directly applicable. Our analysis therefore focused on the integration of GlyRs within the postsynaptic scaffold, rather than attempting to define nanodomain structures (see also response to point 8 of R1). A clarification has been added in the revised manuscript (p. 6).

      Reviewer #1 (Significance): 

      The paper presents biological and technical advances. The biological insights revolve mostly on the documentation of Glycine receptors in particular synapses in forebrain, where they are typically expressed at very low levels. The authors provide compelling data indicating that the expression is of physiological significance. The authors have done a nice job of combining genetically-tagged mice with advanced microscopy methods to tackle the question of distributions of synaptic proteins. Overall these advances are more incremental than groundbreaking. 

      We thank the reviewer for acknowledging both the technical and biological advances of our study. While we recognize that our work builds upon established models, we consider that it also addresses important unresolved questions, namely that GlyRs are present and specifically anchored at inhibitory synapses in telencephalic regions, such as the hippocampus and striatum. From a methodological point of view, our study demonstrates that SMLM can be applied not only for structural analysis of highly abundant proteins, but also to reliably detect proteins present at very low copy numbers. This ability to identify and quantify sparse molecule populations adds a new dimension to SMLM applications, which we believe increases the overall impact of our study beyond the field of synaptic neuroscience.

      Reviewer #2 (Evidence, reproducibility and clarity): 

      In their manuscript "Single molecule counting detects low-copy glycine receptors in hippocampal and striatal synapses" Camuso and colleagues apply single molecule localization microscopy (SMLM) methods to visualize low copy numbers of GlyRs at inhibitory synapses in the hippocampal formation and the striatum. SMLM analysis revealed higher copy numbers in striatum compared to hippocampal inhibitory synapses. They further provide evidence that these low copy numbers are tightly linked to post-synaptic scaffolding protein gephyrin at inhibitory synapses. Their approach profits from the high sensitivity and resolution of SMLM and challenges the controversial view on the presence of GlyRs in these formations although there are reports (electrophysiology) on the presence of GlyRs in these particular brain regions. These new datasets in the current manuscript may certainly assist in understanding the complexity of fundamental building blocks of inhibitory synapses. 

      However I have some minor points that the authors may address for clarification: 

      (1) In Figure 1 the authors apply PALM imaging of mEos4b-GlyRß (knockin) and here the corresponding Sylite label seems to be recorded in widefield, it is not clearly stated in the figure legend if it is widefield or super-resolved. In Fig 1 A - is the scale bar 5 µm? Some Sylite spots appear to be sized around 1 µm, especially the brighter spots, but maybe this is due to the lower resolution of widefield imaging? Regarding the statistical comparison: what method was chosen to test for normality distribution, I think this point is missing in the methods section. 

      This is correct; the apparent size of the Sylite spots does not reflect the real size of the synaptic gephyrin domain due to the limited resolution of widefield imaging including the detection of outof-focus light. We have clarified in the legend of Fig. 1A that Sylite labelling was with classic epifluorescence microscopy. The scale bar in Fig. 1A corresponds to 5 µm. Since the data were not normally distributed, nonparametric tests (Kruskal- Wallis one-way ANOVA with Dunn’s multiple comparison test or Mann-Whitney U-test for pairwise comparisons) were used (p. 23). 

      Moreover I would appreciate a clarification and/or citation that the knockin model results in no structural and physiological changes at inhibitory synapses, I believe this model has been applied in previous studies and corresponding clarification can be provided. 

      The Glrbeos/eos mouse model has been described previously and does not exhibit any structural or physiological phenotypes (Maynard et al. 2021 eLife). The issue was also raised by reviewer R1 (point 5) and has been clarified in the revised manuscript (p. 4).

      (2) In the next set of experiments the authors switch to demixing dSTORM experiments - an explanation why this is performed is missing in the text - I guess better resolution to perform more detailed distance measurements? For these experiments: which region of the hippocampus did the authors select, I cannot find this information in legend or main text. 

      Yes, the dSTORM experiments enable dual-colour structural analysis at high spatial resolution (see response to R1 point 7). An explanation has been added (p. 6).

      (3) Regarding parameters of demixing experiments: the number of frames (10.000) seems quite low and the exposure time higher than expected for Alexa 647. Can the authors explain the reason for chosing these particular parameters (low expression profile of the target - so better separation?, less fluorophores on label and shorter collection time?) or is there a reference that can be cited? The laser power is given in the methods in percentage of maximal output power, but for better comparison and reproducibility I recommend to provide the values of a power meter (kW/cm2) as lasers may change their maximum output power during their lifetime. 

      Acquisition parameters (laser power, exposure time) for dSTORM were chosen to obtain a good localisation precision (~12 nm; see R1 point 8). The number of frames is adequate to obtain well sampled gephyrin scaffolds in the CF680 channel. In the case of the GlyR (nanobody-AF647), the concept of spatial resolution does not really apply due to the low number of targets (see R1, point 13). Power density (irradiance) values have now been given (pp. 18, 21).

      (4) For analysis of subsynaptic distribution: how did the authors decide to choose the parameters in the NEO software for DBSCAN clustering - was a series of parameters tested to find optimal conditions and did the analysis start with an initial test if data is indeed clustered (K-ripley) or is there a reference in literature that can be provided? 

      DBSCAN parameters were optimised manually, by testing different values. Identification of dense and well-delimited gephyrin clusters (CF680) was achieved with a small radius and a high number of detections (80 nm, ≥ 50 neighbours), whereas filtering of low-density background in the AF647 channel (GlyRs) required less stringent parameters (200 nm, ≥ 5) due to the low number of target molecules. Similar parameters were used in a previous publication (Khayenko et al. 2022, Angewandte Chemie). The reference has been provided on p. 22 (see also R1 point 9).

      (5) A conclusion/discussion of the results presented in Figure 5 is missing in the text/discussion. 

      This part of the manuscript has been completely overhauled. It includes new experimental data, quantification of the data (new Fig.5), as well as the discussion and interpretation of our findings (see also R1, point 3). In agreement with our earlier interpretation, the data confirm that low availability of GlyRa1 subunits limits the expression and synaptic targeting of GlyRa1/b heteropentamers. The observation that GlyRa1 overexpression with lentivirus increases the size of the postsynaptic gephyrin domain further points to a structural role, whereby GlyRs can enhance the stability (and size) of inhibitory synapses in hippocampal neurons, even at low copy numbers (pp. 13-14). 

      (6) In line 552 "suspension" is misleading, better use "solution" 

      Done.

      Reviewer #2 (Significance): 

      Significance: The manuscript provides new insights to presence of low-copy numbers by visualizing them via SMLM. This is the first report that visualizes GlyR optically in the brain applying the knock-in model of mEOS4b tagged GlyRß and quantifies their copy number comparing distribution and amount of GlyRs from hippocampus and striatum. Imaging data correspond well to electrophysiological measurements in the manuscript. 

      Field of expertise: Super-Resolution Imaging and corresponding analysis 

      Reviewer #4 (Evidence, reproducibility and clarity): 

      In this study, Camuso et al., make use of a knock-in mouse model expressing endogenously mEos4b-tagged GlyRβ to detect endogenous glycine receptors using single-molecule localization microscopy. The main conclusion from this study is that in the hippocampus GlyRβ molecules are barely detected, while inhibitory synapses in the ventral striatum seem to express functionally relevant GlyR numbers. 

      I have a few points that I hope help to improve the strength of this study. 

      - In the hippocampus, this study finds that the numbers of detections are very low. The authors perform adequate controls to indicate that these localizations are above noise level. Nevertheless, it remains questionable that these reflect proper GlyRs. The suggestion that in hippocampal synapses the low numbers of GlyRβ molecules "are important in assembly or maintenance of inhibitory synaptic structures in the brain" is on itself interesting, but is not at all supported. It is also difficult to envision how such low numbers could support the structure of a synapse. A functional experiment showing that knockdown of GlyRs affects inhibitory synapse structure in hippocampal neurons would be a minimal test of this. 

      It is not clear what the reviewer means by “it remains questionable that these reflect proper GlyRs”. The PALM experiments include a series of stringent controls (see R1, point 1) demonstrating the existence of low-copy GlyRs at inhibitory synapses in the hippocampus (Fig. 1) and in the striatum (Fig. 3), and are backed up by dSTORM experiments (Fig. 2). We have no reason to doubt that these receptors are fully functional (as demonstrated for the ventral striatum (Fig. 4). However, due to their low number, a role in inhibitory synaptic transmission is clearly limited, at least in the hippocampus and dorsal striatum. 

      We therefore propose a structural role, where the GlyRs could be required to stabilise the postsynaptic gephyrin domain in hippocampal neurons. This is based on the idea that the GlyRgephyrin affinity is much higher than that of the GABAAR-gephyrin interaction (reviewed in Kasaragod & Schindelin 2018 Front Mol Neurosci). Accordingly, there is a close relationship between GlyRs and gephyrin numbers, sub-synaptic distribution, and dynamics in spinal cord synapses that are mostly glycinergic (Specht et al. 2013 Neuron; Maynard et al. 2021 eLife; Chapdelaine et al. 2021 Biophys J). It is reasonable to assume that low-copy GlyRs could play a similar structural role at hippocampal synapses. A knockdown experiment targeting these few receptors is technically very challenging and beyond the scope of this study. However, in response to the reviewer's question we have conducted new experiments in cultured hippocampal neurons (new Fig. 5). They demonstrate that overexpression of GlyRa1/b heteropentamers increases the size of the postsynaptic domain in these neurons, supporting our interpretation of a structural role of low-copy GlyRs (p. 14).

      - The endogenous tagging strategy is a very strong aspect of this study and provides confidence in the labeling of GlyRβ molecules. One caveat however, is that this labeling strategy does not discriminate whether GlyRβ molecules are on the cell membrane or in internal compartments. Can the authors provide an estimate of the ratio of surface to internal GlyRβ molecules? 

      Gephyrin is known to form a two-dimensional scaffold below the synaptic membrane to which inhibitory GlyRs and GABAARs attach (reviewed in Alvarez 2017 Brain Res). The majority of the synaptic receptors are therefore thought to be located in the synaptic membrane, which is supported by the close relationship between the sub-synaptic distribution of GlyRs and gephyrin in spinal cord neurons (e.g. Maynard et al. 2021 eLife). To demonstrate the surface expression of GlyRs at hippocampal synapses we labelled cultured hippocampal neurons expressing mEos4b-GlyRa1 with anti-Eos nanobody in non-permeabilised neurons (see Author response image 1). The close correspondence between the nanobody (AF647) and the mEos4b signal confirms that the majority of the GlyRs are indeed located in the synaptic membrane.

      Author response image 1.

      Left: Lentivirus expression of mEos4b-GlyRa1 in fixed and non-permeabilised hippocampal neurons (mEos4b signal). Right: Surface labelling of the recombinant subunit with anti-Eos nanoboby (AF647). 

      - “We also estimated the absolute number of GlyRs per synapse in the hippocampus. The number of mEos4b detections was converted into copy numbers by dividing the detections at synapses by the average number of detections of individual mEos4b-GlyRβ containing receptor complexes”. In essence this is a correct method to estimate copy numbers, and the authors discuss some of the pitfalls associated with this approach (i.e., maturation of fluorophore and detection limit). Nevertheless, the authors did not subtract the number of background localizations determined in the two negative control groups. This is critical, particularly at these low-number estimations. 

      We fully agree that background subtraction can be useful with low detection numbers. In the revised manuscript, copy numbers are now reported as background-corrected values. Specifically, the mean number of detections measured in wildtype slices was used to calculate an equivalent receptor number, which was then subtracted from the copy number estimates across hippocampus, spinal cord and striatum. This procedure is described in the methods (p. 20) and results (p. 5, 8), and mentioned in the figure legends of Fig. 1C, 3C. The background corrected values are given in the text and Table 1.

      - Furthermore, the authors state that "The advantage of this estimation is that it is independent of the stoichiometry of heteropentameric GlyRs". However, if the stoichometry is unknown, the number of counted GlyRβ subunits cannot simply be reported as the number of GlyRs. This should be discussed in more detail, and more carefully reported throughout the manuscript. 

      The reviewer is right to point this out. There is still some debate about the stoichiometry of heteropentameric GlyRs. Configurations with 2a:3b, 3a:2b and 4a:1b subunits have been advanced (e.g. Grudzinska et al. 2005 Neuron; Durisic et al. 2012 J Neurosci; Patrizio et al. 2017 Sci Rep; Zhu & Gouaux 2021 Nature). We have therefore chosen a quantification that is independent of the underlying stoichiometry. Since our quantification is based on very sparse clusters of mEos4b detections that likely originate from a single receptor complex (irrespective of its stoichiometry), the reported values actually reflect the number of GlyRs (and not GlyRb subunits). We have clarified this in the results (p. 5) and throughout the manuscript (Table 1). 

      - The dual-color imaging provides insights in the subsynaptic distribution of GlyRβ molecules in hippocampal synapses. Why are similar studies not performed on synapses in the ventral striatum where functionally relevant numbers of GlyRβ molecules are found? Here insights in the subsynaptic receptor distribution would be of much more interest as it can be tight to the function. 

      This is an interesting suggestion. However, the primary aim of our study was to identify the existence of GlyRs in hippocampal regions. At low copy numbers, the concept of sub-synaptic domains (SSDs, e.g. Yang et al. 2021 EMBO Rep) becomes irrelevant (see R1 point 13). It should be pointed out that the dSTORM pointillist images (Fig. 2A) represent individual GlyR detections rather than clusters of molecules. In the striatum, our specific purpose was to solve an open question about the presence of GlyRs in different subregions (putamen, nucleus accumbens).

      - It is unclear how the experiments in Figure 5 add to this study. These results are valid, but do not seem to directly test the hypothesis that "the expression of α subunits may be limiting factor controlling the number of synaptic GlyRs". These experiments simply test if overexpressed α subunits can be detected. If the α subunits are limiting, measuring the effect of α subunit overexpression on GlyRβ surface expression would be a more direct test. 

      Both R1 and R2 have also commented on the data in Fig. 5 and their interpretation. We have substantially revised this section as described before (see R1 point 3) including additional experiments and quantification of the data (new Fig. 5). The findings lend support to our earlier hypothesis that GlyR alpha subunits (in particular GlyRa1) are the limiting factor for the expression of heteropentameric GlyRa/b in hippocampal neurons (pp. 13-14). Since the GlyRa1 subunit itself does not bind to gephyrin (Patrizio et al. 2017 Sci Rep), the synaptic localisation of the recombinant mEos4b-GlyRa1 subunits is proof that they have formed heteropentamers with endogenous GlyRb subunits and driven their membrane trafficking, which the GlyRb subunits are incapable of doing on their own.

      Reviewer #4 (Significance): 

      These results are based on carefully performed single-molecule localization experiments, and are well-presented and described. The knockin mouse with endogenously tagged GlyRβ molecules is a very strong aspect of this study and provides confidence in the labeling, the combination with single-molecule localization microscopy is very strong as it provides high sensitivity and spatial resolution. 

      The conceptual innovation however seems relatively modest, these results confirm previous studies but do not seem to add novel insights. This study is entirely descriptive and does not bring new mechanistic insights. 

      This study could be of interest to a specialized audience interested in glycine receptor biology, inhibitory synapse biology and super-resolution microscopy. 

      My expertise is in super-resolution microscopy, synaptic transmission and plasticity 

      As we have stated before, the novelty of our study lies in the use of SMLM for the identification of very small numbers of molecules, which requires careful control experiments. This is something that has not been done before and that can be of interest to a wider readership, as it opens up SMLM for ultrasensitive detection of rare molecular events. Using this approach, we solve two open scientific questions: (1) the demonstration that low-copy GlyRs are present at inhibitory synapses in the hippocampus, (2) the sub-region specific expression and functional role of GlyRs in the ventral versus dorsal striatum.

      The following review was provided later under the name “Reviewer #4”. To avoid confusion with the last reviewer from above we will refer to this review as R4-2.

      Reviewer #4-2 (Evidence, reproducibility and clarity):  

      Summary:

      Provide a short summary of the findings and key conclusions (including methodology and model system(s) where appropriate).

      The authors investigate the presence of synaptic glycine receptors in the telencephalon, whose presence and function is poorly understood. 

      Using a transgenically labeled glycine receptor beta subunit (Glrb-mEos4b) mouse model together with super-resolution microscopy (SLMM, dSTORM), they demonstrate the presence of a low but detectable amount of synaptically localized GLRB in the hippocampus. While they do not perform a functional analysis of these receptors, they do demonstrate that these subunits are integrated into the inhibitory postsynaptic density (iPSD) as labeled by the scaffold protein gephyrin. These findings demonstrate that a low level of synaptically localized glycerine receptor subunits exist in the hippocampal formation, although whether or not they have a functional relevance remains unknown.

      They then proceed to quantify synaptic glycine receptors in the striatum, demonstrating that the ventral striatum has a significantly higher amount of GLRB co-localized with gephyrin than the dorsal striatum or the hippocampus. They then recorded pharmacologically isolated glycinergic miniature inhibitory postsynaptic currents (mIPSCs) from striatal neurons. In line with their structural observations, these recordings confirmed the presence of synaptic glycinergic signaling in the ventral striatum, and an almost complete absence in the dorsal striatum. Together, these findings demonstrate that synaptic glycine receptors in the ventral striatum are present and functional, while an important contribution to dorsal striatal activity is less likely.

      Lastly, the authors use existing mRNA and protein datasets to show that the expression level of GLRA1 across the brain positively correlates with the presence of synaptic GLRB.

      The authors use lentiviral expression of mEos4b-tagged glycine receptor alpha1, alpha2, and beta subunits (GLRA1, GLRA1, GLRB) in cultured hippocampal neurons to investigate the ability of these subunits to cause the synaptic localization of glycine receptors. They suggest that the alpha1 subunit has a higher propensity to localize at the inhibitory postsynapse (labeled via gephyrin) than the alpha2 or beta subunits, and may therefore contribute to the distribution of functional synaptic glycine receptors across the brain.

      Major comments:

      - Are the key conclusions convincing?

      The authors are generally precise in the formulation of their conclusions.

      (1) They demonstrate a very low, but detectable, amount of a synaptically localized glycine receptor subunit in a transgenic (GlrB-mEos4b) mouse model. They demonstrate that the GLRB-mEos4b fusion protein is integrated into the iPSD as determined by gephyrin labelling. The authors do not perform functional tests of these receptors and do not state any such conclusions.

      (2) The authors show that GLRB-mEos4b is clearly detectable in the striatum and integrated into gephyrin clusters at a significantly higher rate in the ventral striatum compared to the dorsal striatum, which is in line with previous studies.

      (3) Adding to their quantification of GLRB-mEos4b in the striatum, the authors demonstrate the presence of glycinergic miniature IPSCs in the ventral striatum, and an almost complete absence of mIPSCs in the dorsal striatum. These currents support the observation that GLRB-mEos4b is more synaptically integrated in the ventral striatum compared to the dorsal striatum.

      (4) The authors show that lentiviral expression of GLRA1-mEos4b leads to a visually higher number of GLR clusters in cultured hippocampal neurons, and a co-localization of some clusters with gephyrin. The authors claim that this supports the idea that GLRA1 may be an important driver of synaptic glycine receptor localization. However, no quantification or statistical analysis of the number of puncta or their colocalization with gephyrin is provided for any of the expressed subunits. Such a claim should be supported by quantification and statistics 

      A thorough analysis and quantification of the data in Fig.5 has been carried out as requested by all the other reviewers (e.g. R1, point 3). The new data and results have been described in the revised manuscript (pp. 9-10, 13-14).

      - Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether?

      One unaddressed caveat is the fact that a GLRB-mEos4b fusion protein may behave differently in terms of localization and synaptic integration than wild-type GLRB. While unlikely, it is possible that mEos4b interacts either with itself or synaptic proteins in a way that changes the fused GLRB subunit’s localization. Such an effect would be unlikely to affect synaptic function in a measurable way, but might be detected at a structural level by highly sensitive methods such as SMLM and STORM in regions with very low molecule numbers (such as the hippocampus). Since reliable antibodies against GLRB in brain tissue sections are not available, this would be difficult to test. Considering that no functional measures of the hippocampal detections exist, we would suggest that this possible caveat be mentioned for this particular experiment.

      This question has also been raised before (R1, point 5). According to an earlier study the mEos4b-GlyRb knock-in does not cause any obvious phenotypes, with the possible exception of minor loss of glycine potency (Maynard et al. 2021 eLife). The fact that the synaptic levels in the spinal cord in heterozygous animals are precisely half of those of homozygous animals argues against differences in receptor expression, heteropentameric assembly, forward trafficking to the plasma membrane and integration into the synaptic membrane as confirmed using quantitative super-resolution CLEM (Maynard et al. 2021 eLife). Accordingly, we did not observe any behavioural deficits in these animals, making it a powerful experimental model. We have added this information in the revised manuscript (p. 4). 

      In addition, without any quantification or statistical analysis, the author’s claims regarding the necessity of GLRA1 expression for the synaptic localization of glycine receptors in cultured hippocampal neurons should probably be described as preliminary (Fig. 5).

      As mentioned before, we have substantially revised this part (R1, point 3). The quantification and analysis in the new Fig. 5 support our earlier interpretation.

      - Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation.

      The authors show that there is colocalization of gephyrin with the mEos4b-GlyRβ subunit using the Dual-colour SMLM. This is a powerful approach that allows for a claim to be made on the synaptic location of the glycine receptors. The images presented in Figure 1, together with the distance analysis in Figure 2, display the co-localization of the fluorophores. The co-localization images in all the selected regions, hippocampus and striatum, also show detections outside of the gephyrin clusters, which the authors refer to as extrasynaptic. These punctated small clusters seem to have the same size as the ones detected and assigned as part of the synapse. It would be informative if the authors analysed the distribution, density and size of these nonsynaptic clusters and presented the data in the manuscript and also compared it against the synaptic ones. Validating this extrasynaptic signal by staining for a dendritic marker, such as MAP-2 or maybe a somatic marker and assessing the co-localization with the non-synaptic clusters would also add even more credibility to them being extrasynaptic. 

      The existence of extrasynaptic GlyRs is well attested in spinal cord neurons (e.g. Specht et al. 2013 Neuron; this study see Fig. S2). The fact that these appear as small clusters of detections in SMLM recordings results from the fact that a single fluorophore can be detected several times in consecutive image frames and because of blinking. Therefore, small clusters of detections likely represent single GlyRs (that can be counted), and not assemblies of several receptor complexes. Due to their diffusion in the neuronal membrane, they are seen as diffuse signals throughout the somatodendritic compartment in epifluorescence images (e.g. Fig. 5A). SMLM recordings of the same cells resolves this diffuse signal into discrete nanoclusters representing individual receptors (Fig. 5B). It is not clear what information co-localisation experiments with specific markers could provide, especially in hippocampal neurons, in which the copy numbers (and density) of GlyRs is next to zero.

      In addition we would encourage the authors to quantify the clustering and co-localization of virally expressed GLRA1, GLRA2, and GLRB with gephyrin in order to support the associated claims (Fig. 5). Preferably, the density of GLR and gephyrin clusters (at least on the somatic surface, the proximal dendrites, or both) as well as their co-localization probability should be quantified if a causal claim about subunit-specific requirements for synaptic localization is to be made.

      Quantification of the data have been carried out (new Fig.5C,D). The results have been described before (R1, point 3) and support our earlier interpretation of the data (pp. 13-14).

      Lastly, even though it may be outside of the scope of such a study analysing other parts of the hippocampal area could provide additional important information. If one looks at the Allen Institute’s ISH of the beta subunit the strongest signal comes from the stratum oriens in the CA1 for example, suggesting that interneurons residing there would more likely have a higher expression of the glycine receptors. This could also be assessed by looking more carefully at the single cell transcriptomics, to see which cell types in the hippocampus show the highest mRNA levels. If the authors think that this is too much additional work, then perhaps a mention of this in the discussion would be good. 

      We have added the requested information from the ISH database of the Allen Institute in the discussion as suggested by the reviewer (p. 12). However, in combination with the transcriptomic data (Fig. S1) our finding strongly suggest that the expression of synaptic GlyRs depends on the availability of alpha subunits rather than on the presence of the GlyRb transcript. This is obvious when one compares the mRNA levels in the hippocampus with those in the basal ganglia (striatum) and medulla. While the transcript concentrations of GlyRb are elevated in all three regions and essentially the same, our data show that the GlyRb copy numbers at synapses differ over more than 2 orders of magnitude (Fig. 1B, Table 1). 

      - Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments.

      Since the labeling and some imaging has been performed already, the requested experiment would be a matter of deploying a method of quantification. In principle, it should not require any additional wet-lab experiments, although it may require additional imaging of existing samples.

      - Are the data and the methods presented in such a way that they can be reproduced?

      Yes, for the most part.

      - Are the experiments adequately replicated and statistical analysis adequate?

      Yes

      Minor comments:

      - Specific experimental issues that are easily addressable.

      N/A

      - Are prior studies referenced appropriately?

      Yes

      - Are the text and figures clear and accurate?

      Yes, although quantification in figure 5 is currently not present.

      A quantification has been added (see R1, point 3).

      - Do you have suggestions that would help the authors improve the presentation of their data and conclusions?

      This paper presents a method that could be used to localize receptors and perhaps other proteins that are in low abundance or for which a detailed quantification is necessary. I would therefore suggest that Figure S4 is included into Figure 2 as the first panel, showcasing the demixing, followed by the results. 

      We agree in principle with this suggestion. However, the revised Fig. S4 is more complex and we think that it would distract from the data shown in Fig. 2. Given that Fig. S4 is mostly methodological and not essential to understand the text, we have kept it in the supplement for the time being. We leave the final decision on this point to the editor.

      Reviewer #4-2 (Significance): 

      [This review was supplied later]

      - Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field.

      Using a novel and high resolution method, the authors have provided strong evidence for the presence of glycine receptors in the murine hippocampus and in the dorsal striatum. The number of receptors calculated is small compared to the numbers found in the ventral striatum. This is the first study to quantify receptor numbers in these region. In addition it also lays a roadmap for future studies addressing similar questions. 

      - Place the work in the context of the existing literature (provide references, where appropriate).

      This is done well by the authors in the curation of the literature. As stated above, the authors have filled a gap in the presence of glycine receptors in different brain regions, a subject of importance in understanding the role they play in brain activity and function. 

      - State what audience might be interested in and influenced by the reported findings.

      Neuroscientists working at the synaptic level, on inhibitory neurotransmission and on fundamental mechanisms of expression of genes at low levels and their relationship to the presence of the protein would be interested. Furthermore, researchers in neuroscience and cell biology may benefit from and be inspired by the approach used in this manuscript, to potentially apply it to address their own aims. 

      We thank the reviewer for the positive assessment of the technical and biological implications of our work, as well as the interest of our findings to a wide readership of neuroscientists and cell biologists. 

      - Define your field of expertise with a few keywords to help the authors contextualize your point of view. Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate.

      Synaptic transmission, inhibitory cells and GABAergic synapses functionally and structurally, cortex and cortical circuits. No strong expertise in super-resolution imaging methods.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      This very thorough anatomical study addresses the innervation of the Drosophila male reproductive tract. Two distinct glutamatergic neuron types were classified: serotonergic (SGNs) and octopaminergic (OGNs). By expansion microscopy, it was established that glutamate and serotonin /octopamine are co-released. The expression of different receptors for 5-HT and OA in muscles and epithelial cells of the innervation target organs was characterized. The pattern of neurotransmitter receptor expression in the target organs suggests that seminal fluid and sperm transport and emission are subjected to complex regulation. While silencing of abdominal SGNs leads to male infertility and prevents sperm from entering the ejaculatory duct, silencing of OGNs does not render males infertile. 

      Strengths: 

      The studied neurons were analysed with different transgenes and methods, as well as antibodies against neurotransmitter synthesis enzymes, building a consistent picture of their neurotransmitter identity. The careful anatomical description of innervation patterns together with receptor expression patterns of the target organs provides a solid basis for advancing the understanding of how seminal fluid and sperm transport and emission are subjected to complex regulation. The functional data showing that SGNs are required for male fertility and for the release of sperm from the seminal vesicle into the ejaculatory duct is convincing. 

      Weaknesses: 

      The functional analysis of the characterized neurons is not as comprehensive as the anatomical description, and phenotypic characterization was limited to simple fertility assays. It is understandable that a full functional dissection is beyond the scope of the present work. The paper contains experiments showing neuron-independent peristaltic waves in the reproductive tract muscles, which are thematically not very well integrated into the paper. Although very interesting, one wonders if these experiments would not fit better into a future work that also explores these peristaltic waves and their interrelation with neuromodulation mechanistically. 

      Reviewer #2 (Public review): 

      Summary: 

      Cheverra et al. present a comprehensive anatomical and functional analysis of the motor neurons innervating the male reproductive tract in Drosophila melanogaster, addressing a gap in our understanding of the peripheral circuits underlying ejaculation and male fertility. They identify two classes of multi-transmitter motor neurons-OGNs (octopamine/glutamate) and SGNs (serotonin/glutamate)-with distinct innervation patterns across reproductive organs. The authors further characterize the differential expression of glutamate, octopamine, and serotonin receptors in both epithelial and muscular tissues of these organs. Behavioral assays reveal that SGNs are essential for male fertility, whereas OGNs and glutamatergic transmission are dispensable. This work provides a high-resolution map linking neuromodulatory identity to organ-specific motor control, offering a valuable framework to explore the neural basis of male reproductive function. 

      Strengths: 

      Through the use of an extensive set of GAL4 drivers and antibodies, this work successfully and precisely defines the neurons that innervate the male reproductive tract, identifying the specific organs they target and the nature of the neurotransmitters they release. It also characterizes the expression patterns and localization of the corresponding neurotransmitter receptors across different tissues. The authors describe two distinct groups of dual-identity neurons innervating the male reproductive tract: OGNs, which co-express octopamine and glutamate, and SGNs, which co-express serotonin and glutamate. They further demonstrate that the various organs within the male reproductive system differentially express receptors for these neurotransmitters. Based on these findings, the authors propose that a single neuron capable of co-releasing a fast-acting neurotransmitter alongside a slower-acting one may more effectively synchronize and stagger events that require precise timing. This, together with the differential expression of ionotropic glutamate receptors and metabotropic aminergic receptors in postsynaptic muscle tissue, adds an additional layer of complexity to the coordinated regulation of fluid secretion, organ contractility, and directional sperm movement-all contributing to the optimization of male fertility. 

      Weaknesses: 

      The main weakness of the manuscript is the lack of detail in the presentation of the results. Specifically, all microscopy image figures are missing information about the number of samples (N), and in the case of colocalization experiments, quantitative analyses are not provided. Additionally, in the first behavioral section, it would be beneficial to complement the data table with figures similar to those presented later in the manuscript for consistency and clarity. 

      Wider context: 

      This study delivers the first detailed anatomical map connecting multi-transmitter motor neurons with specific male reproductive structures. It highlights a previously unrecognized functional specialization between serotonergic and octopaminergic pathways and lays the groundwork for exploring fundamental neural mechanisms that regulate ejaculation and fertility in males. The principles uncovered here may help explain how males of Drosophila and other organisms adjust reproductive behaviors in response to environmental changes. Furthermore, by shedding light on how multi-transmitter systems operate in reproductive control, this model could provide insights into therapeutic targets for conditions such as male infertility and prostate cancer, where similar neuronal populations are involved in humans. Ultimately, this genetically accessible system serves as a powerful tool for uncovering how multi-transmitter neurons orchestrate coordinated physiological actions necessary for the functioning of complex organs. 

      Reviewer #3 (Public review): 

      Summary: 

      This work provides an overview of the motor neuron landscape in the male reproductive system. Some work had been done to elucidate the circuits of ejaculation in the spine, as well as the cord, but this work fills a gap in knowledge at the level of the reproductive organs. Using complementary approaches, the authors show that there are two types of motor neurons that are mutually exclusive: neurons that co-express octopamine and glutamate and neurons that co-express serotonin and glutamate. They also show evidence that both types of neurons express large dense core vesicles, indicating that neuropeptides play a role in male fertility. This paper provides a thorough characterization of the expression of the different glutamate, octopamine, and serotonin receptors in the different organs and tissues of the male reproductive system. The differential expression in different tissues and organs allows building initial theories on the control of emission and expulsion. Additionally, the authors characterize the expression of synaptic proteins and the neuromuscular junction sites. On a mechanistic level, the authors show that neither octopamine/glutamate neuron transmission nor glutamate transmission in serotonin/glutamate neurons is required for male fertility. This final result is quite surprising and opens up many questions on how ejaculation is coordinated. 

      Strengths: 

      This work fills an important gap in the characterization of innervation of the male reproductive system by providing an extensive characterization of the motor neurons and the potential receptors of motor neuron release. The authors show convincing evidence of glutamate/monoamine co-release and of mutual exclusivity of serotonin/glutamate and octopamine/glutamate neurons. 

      Weaknesses: 

      (1) Often, it is mentioned that the expression is higher or lower or regional without quantification or an indication of the number of samples analysed. 

      (2) The experiment aimed at tracking sperm in the male reproductive system is difficult to interpret when it is not assessed whether ejaculation has occurred. 

      (3) The experiment looking at peristaltic waves in the male organs is missing labeling of the different regions and quantification of the observed waves. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) While the peripheral innervations are very carefully described, it is not clear to which SGNs and OGNs (i.e., cell bodies in the central nervous system) these innervations belong. Are SV, AG, and ED innervated by branches of one neuron or by separate neurons? Multi-color flip-out experiments could provide an answer to this. 

      We agree this is important and are planning these experiments for follow-up study.

      (2) In contrast, for the analysis of the VT19028 split line (Figure 9), only vnc and cell body images are shown. How do the arborisations of these split combinations look in the periphery? Are the same reproductive organs innervated as shown in Figure 2?

      Figure 9S3 was inadvertently omitted from the initial submission.  That figure is now included and shows that the VT019028 split broadly innervates the SV, AG, and ED.

      (3) In the discussion, I think it would be helpful to offer some potential explanations for the role of octopaminergic and glutamatergic signaling. If not required for basic fertility, they probably have some other role.

      Thank you, we have included speculation in the Discussion section "Potential for adaptation to environment".

      (4) Line 543: Figure 8S4 E, (not 8E). 

      Correction made.

      Reviewer #2 (Recommendations for the authors): 

      (1) Line 213-217 

      Comment:

      The use of "significantly less expression" may be misleading, as no quantification or statistical analysis is provided to support this comparison. 

      Suggestion:

      Consider using a more neutral term, such as "markedly less" or "noticeably less," unless quantitative data and statistical analysis are included to substantiate the claim.

      Good recommendation.This suggestion has been incorporated.

      (2) Line 264-267 

      Comment:

      The observation regarding the distinct morphology of SGNs and OGNs is interesting and could strengthen the argument regarding functional differences. 

      Suggestion: 

      Consider including a quantification of morphological complexity (e.g., branching) to support the claim. A method such as Sholl analysis (Sholl, 1953), as adapted in Fernández et al., 2008, could be applied. 

      This is a good suggestion, and we will consider it as part of a follow-up study.

      (3) Line 269-271 

      Comment:

      The anatomical context of the observation is not explicitly stated. 

      Suggestion:

      Add "in the ED" for clarity: "With the TRH-GAL4 experiment in the ED, vGlut-40XMYC (Figure 5S1, A and E) and 6XV5-vMAT (Figure 5S1, B and F) were both present with a highly overlapping distribution (Figure 5S1, I)." 

      Suggestion has been incorporated.

      (4) Line 275-276 

      Comment:

      The claim about the reduced ability to distinguish SGNs and OGNs in the ED would benefit from quantitative support. 

      Suggestion:

      Include a morphological comparison or quantification between SGNs and OGNs in the ED and SV to reinforce this point.

      Certain information on morphological comparison can be inferred within the images themselves, and we will include quantitation in a follow-up study.

      (5) Line 277-279 

      Comment:

      As with line 269, the anatomical site could be specified more clearly. 

      Suggestion: 

      Rephrase as: "With the Tdc2-GAL4 experiment in the ED, vGlut-40XMYC (Figure 5S1, M and Q) and 6XV5-vMAT (Figure 5S1, N and R) were both observed in a highly overlapping distribution (Figure 5S1, U)." 

      Suggestion has been incorporated.

      (6) Line 348-350 

      Comment:

      The phrase "significantly higher density" implies a statistical comparison that is not shown. 

      Suggestion:

      If no quantification is provided, replace with a qualitative term such as "visibly higher" or "notably more dense." Alternatively, add a quantitative analysis with statistical testing to justify the use of "significantly." 

      Suggestion has been incorporated.

      (7) Lines 415-458 (Section comment) 

      Comment:

      There appears to be differential localization of neurotransmitter receptor expression (glutamate in muscle vs. 5-HT in epithelium or neurons), which could have functional implications. 

      Suggestion:

      Expand this section to briefly discuss the differential localization patterns of these receptors and potential implications for signal transduction in male reproductive tissues. 

      (8) Lines 638-682 (Section comment) 

      Comment:

      The table summarizing fertility phenotypes would be more informative with additional detail on experimental outcomes. 

      Suggestion:

      Add a column showing the number of fertile males over the total tested (e.g., "n fertile / n total"). Also, clarify whether the fertility assays are identical to those reported in Figure 10S2, and whether similar analyses were conducted for females. Consider including a figure summarizing fertility results for all genotypes listed in the table, similar to Figure 10S2. 

      The fertility tests reported in Table 1 were separate from those reported in Figure 10S2.  For these tests, the results were clear-cut with 100% of males and females reported as infertile exhibiting the infertile phenotype.  For the males and females reported as fertile, it was also clear-cut with nearly 100% showing fertility at a high level.  In subsequent figures we attempted to assess degrees of fertility.

      (9) Line 724-727 

      Comment:

      There seems to be a mistake in the identification of the driver lines used to silence OA neurons. Also, figure references might be incorrect. 

      Suggestion:

      The OA neuron driver line should be corrected to "Tdc2-GAL4-DBD ∩ AbdB-AD" instead of TRH-GAL4. Additionally, the figure references should be verified; specifically, the letter "B" (in "Figure 10B, D" and "10B, E") appears to be unnecessary or misplaced.

      Thanks for catching this, the corrections have been made.

      (10) Line 872-877 

      Comment:

      The discussion on the co-release of fast-acting glutamate and slower aminergic neurotransmitters is interesting and well-articulated. However, it remains somewhat disconnected from the behavioral findings. 

      Suggestion:

      Consider linking this proposed mechanism to the results observed in the mating duration assays. For instance, the sequential action of neurotransmitters described here could potentially underlie the prolonged mating observed when specific neuromodulators are active, helping to functionally integrate molecular and behavioral data. 

      (11) Line 926-928 

      Comment:

      The interpretation of 5-HT7 receptor expression in the sphincter is compelling, suggesting a role in regulating its function. However, this anatomical observation could be further contextualized with the functional data. 

      Suggestion:

      It may strengthen the interpretation to explicitly connect this finding with the fertility assays, where SGNs - presumably acting via serotonergic signaling - are shown to be necessary for male fertility. This would support a functional role for 5-HT7 in reproductive success via sphincter regulation.

      This has been added. 

      (12) Figure 1 

      Comment:

      The figure legend is generally clear, but could benefit from more consistency and precision in the color-coded labeling. Additionally, the naming of some structures could be more explicit. 

      Suggestion: 

      Revise the figure and the legend as follows:

      Figure 1. The Drosophila male reproductive system. A) Schematic diagram showing paired testes (colour), SVs (green), AGs (purple), Sph (red), ED (gray), and EB (colour). B) Actual male reproductive system. Te - testes, SV - seminal vesicle, AG - accessory gland, Sph - singular sphincter, ED - ejaculatory duct, EB - ejaculatory bulb. Scale bar: 200 µm.

      This suggestion has been incorporated.

      (13) Figure 3S2 

      Comment:

      There appears to be a typographical error in the description of the genotypes, which may lead to confusion. 

      Suggestion:

      Correct the legend to reflect the appropriate genotypes:

      Figure 3S2. Expression of vGlut-LexA and Tdc2-GAL4 in the Drosophila male reproductive system. A, D, G, J, M, P) vGlut-LexA, LexAop-6XmCherry; B, E, H, K, N, Q) Tdc2-GAL4, UAS-6XGFP; C, F, I, L, O, R) Overlay. Scale bars: O - 50 µm; R - 10 µm.

      The corrections have been made.

      (14) Figure 3S3

      Comment:

      The genotypes for panels D and E appear to be incomplete; the DBD component of the split-GAL4 drivers is missing. 

      Suggestion:

      Update the figure legend to: 

      Figure 3S3. Fruitless and Doublesex expression in the Drosophila male reproductive system. A) fru-GAL4, UAS-6XGFP; B) vGlut-LexA, LexAop-6XmCherry; C) Overlay; D) Tdc2-AD ∩ dsx-GAL4-DBD; E) TRH-AD ∩ dsx-GAL4-DBD. Scale bar: 200 µm.

      The corrections have been made.

      (15) Figure 4S4 

      Comment: 

      There is a repeated segment in the figure legend, which makes it unclear and redundant. 

      Suggestion:

      Edit the legend to remove the duplicated lines: 

      Figure 4S4. Expression of vGlut, TβH-GFP, and 5-HT at the junction of the SV and AGs with the ED of the Drosophila male reproductive system. A) vGlut-40XV5; B) TβH-GFP; C) 5-HT; D) vGlut-40XV5, TβH-GFP overlay; E) vGlut-40XV5, 5-HT overlay; F) TβH-GFP, 5-HT overlay. Scale bar: 50 µm.

      The correction has been made.

      (16) Figure 6S5 

      Comment:

      Within this figure, the orientation and/or scale of the tissue varies noticeably between individual panels, making it difficult to directly compare the different experimental conditions. 

      Suggestion:

      For improved clarity and interpretability, consider standardizing the orientation and size of the tissue shown across all panels within the figure. Consistent presentation will facilitate direct comparisons between treatments or genotypes. 

      There is often variation in the size of the male reproductive organs. They were all acquired at the same magnification. The only point of this figure is there is no vGAT or vAChT at these NMJs and the result is unambiguously negative. 

      (17) Figure 10 

      Comment:

      Panel A appears redundant, as it shows the same information as the other panels but without indicating statistical significance. 

      Suggestion:

      Consider removing panel A and keeping only the remaining four graphs, which include relevant statistical comparisons and clearly show significant differences.

      We realize there is some redundancy of panel A with the other panels, but we feel there is value in having all the genotypes in a single panel for comparison.

      Reviewer #3 (Recommendations for the authors): 

      Here are some suggestions to improve the manuscript: 

      (1) Prot B GFP experiment: the authors should explain better the time chosen to look at the sperm content of the male reproductive system. At 10 minutes, it is expected that the male has already ejaculated, and therefore, a failure to ejaculate would result in more sperm in the reproductive system, not less. Since we are not certain when the male ejaculates, it would be important to do the analysis at different time points.

      In the Prot-GFP experiments, the 10-minute time point was chosen because we nearly always observe sperm in the ejaculatory duct of control males.  In the experimental males, we never observed sperm in the ejaculatory duct at this time point.  Also, no Prot-GFP sperm were observed in the reproductive tract of females mated to experimental males even when mating was allowed to go to completion, while abundant sperm were found in females mated to Prot-GFP controls.  Figure 10S1 has been updated to include Images of these female reproductive systems.  The results showing the absence of Prot-GFP sperm in the female reproductive tract mated to experimental males indicates sperm transfer in these males isn't occurring earlier during the copulation process than in control males and that we didn't miss it by only examining at the ejaculatory duct.

      (2) Discuss what may be the role of the octopamine/glutamate neurons and glutamate transmission in serotonin/glutamate neurons in the male reproductive system, given that they are not required for fertility (at least under the context in which it was tested). It is quite a striking result that deserves some attention. 

      We agree it is a surprising result and have included speculation on the role of glutamate and octopamine in male reproduction in the Discussion section "Potential for adaptation to environment".

      (3) Very important: 

      (a) Figure 3 is present in the Word document but not the PDF. 

      (b) Figure 9S3 is not present 

      (c) In Figure 5 X), the legend does not correspond to the panel.

      All of these corrections have been made. 

      (4) Other suggestions:

      (a) A summary schematic (or several) of the findings would make it an easier read.

      (b) Explain why the ejaculatory bulb was left out of the analysis.

      (c) Explain in the main text some of the tools, such as, BONT-C and the conditional vGlut mutation.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      In this paper, the authors developed a chemical labeling reagent for P2X7 receptors, called X7-uP. This labeling reagent selectively labels endogenous P2X7 receptors with biotin based on ligand-directed NASA chemistry (Ref. 41). After labeling the endogenous P2X7 receptor with biotin, the receptor can be fluorescently labeled with streptavidin-AlexaFluor647. The authors carefully examined the binding properties and labeling selectivity of X7-uP to P2X7, characterized the labeling site of P2X7 receptors, and demonstrated fluorescence imaging of P2X7 receptors. The data obtained by SDS-PAGE, Western blot, and fluorescence microscopy clearly show that X7-uP labels the P2X7 receptor. Finally, the authors fluorescently labeled the endogenous P2X7 in BV2 cells, which are a murine microglia model, and used dSTORM to reveal a nanoscale P2X7 redistribution mechanism under inflammatory conditions at high resolution. 

      Strengths: 

      X7-uP selectively labels endogenous P2X7 receptors with biotin. Streptavidin-AlexaFluor647 binds to the biotin labeled to the P2X7 receptor, allowing visualization of endogenous P2X7 receptors. 

      We thank the reviewer for their positive comment.

      Weaknesses: 

      Weaknesses & Comments 

      (1) The P2X7 receptor exists in a trimeric form. If it is not a monomer under the conditions of the pull-down assay in Figure 2C, the quantitative values may not be accurate. 

      We thank the reviewer for this comment. As shown in Figure 2C, the band observed on the denaturing SDS-PAGE corresponds to the monomeric form of the P2X7 receptor. While we cannot exclude the presence of non-monomeric species under native conditions, no such higher-order forms are visible in the gel. This observation supports the conclusion that the quantitative values presented are based on the monomeric form and are therefore reliable.

      (2) In Figure 3, GFP fluorescence was observed in the cell. Are all types of P2X receptors really expressed on the cell surface ? 

      We thank the reviewer for this excellent comment, which was also raised by reviewer 2. To address this concern, we performed a commercial cell-surface protein biotinylation assay to assess whether GFP-tagged P2X receptors reach the plasma membrane. As expected, all P2X subtypes except P2X6 were detected at the cell surface in HEK293T cells, thereby validating our confocal fluorescence microscopy assay. These new data are now included in Figure 3 — figure supplement 1.

      (3) The reviewer was not convinced of the advantages of the approach taken in this paper, because the endogenous receptor labeling in this study could also be done using conventional antibody-based labeling methods. 

      We thank the reviewer for raising this important point and would like to highlight several advantages of our approach compared to conventional antibody-based labeling.

      First, commercially available P2X7 antibodies often suffer from poor specificity and are generally not suitable for reliably detecting endogenous P2X7 receptors, as documented in previous studies (e.g., PMID: 16564580 and PMID: 15254086). While recent advances have been made using nanobodies with improved specificity for P2X7 (e.g., PMID: 30074479 and PMID: 38953020), our strategy is distinct and complementary to nanobody-based approaches.

      Second, antibodies rely on non-covalent interactions with the receptor, which can result in dissociation over time. In contrast, our X7-uP probe covalently biotinylates lysine residues on the P2X7 receptor through stable amide bond formation. This covalent labeling ensures that the biotin moiety remains permanently attached, an advantage not afforded by reversible binding strategies.

      Third, by selectively biotinylating P2X7 receptors, our method provides a versatile platform for the chemical attachment of a wide range of probes or functional moieties. Although we did not demonstrate this application in the current study, we believe this modularity represents an additional advantage of our approach.

      We have now revised the discussion to highlight these key advantages, allowing the reader to form their own opinion. We hope this addresses the reviewer’s concerns and clarifies the benefits of our approach.

      (4) Although P2X7 was successfully labeled in this paper, it is not new as a chemistry. There is a need for more attractive functional evaluation such as live trafficking analysis of endogenous P2X7. 

      We agree with the reviewer that the underlying chemistry is not novel per se. However, to our knowledge, it has not previously been applied to the P2X7 receptor, and thus constitutes a novel application with specific relevance for studying native P2X7 biology.

      We also appreciate the reviewer’s suggestion regarding live trafficking analysis of endogenous P2X7. While this is indeed a valuable and interesting direction, we believe it lies beyond the scope of the present study, as it would first require demonstrating that the labeling itself does not affect P2X7 function (see below). This important step would necessitate additional experiments, which we consider more appropriate for a follow-up investigation.

      (5) The reviewer has concerns that the use of the large-size streptavidin to label the P2X7 receptor may perturbate the dynamics of the receptor. 

      We thank the reviewer for raising this important point. Although we did not directly measure receptor dynamics, it is indeed possible that tetrameric streptavidin (tStrept-A 647) could promote P2X7 clustering by cross-linking nearby receptors due to its tetravalency (see also point 7 raised by the reviewer). To address this concern, we performed additional dSTORM experiments using a monomeric form of streptavidin-Alexa 647 (mSA) (see PMID: 26979420). Owing to its reduced size and lack of tetravalency, mSA has been shown to minimize artificial crosslinking of synaptic receptors (PMID: 26979420). A drawback of using mSA, however, is that the monomeric form carries only two fluorophores (estimated degree of labeling, DOL ≈ 2, PMID: 26979420), whereas the tetrameric form, according to the manufacturer’s certificate of analysis (Invitrogen S21374), has an average DOL of three fluorophores per monomer, resulting in a total of ~12 fluorophores per streptavidin.

      We tested three conditions with mSA incubation: (i) control BV2 cells (without X7-uP), (ii) untreated X7-uP-labeled BV2 cells, and (iii) X7-uP-labeled BV2 cells treated with LPS and ATP (using the same concentrations and incubation times described in the manuscript). As shown in Author response image 1, only LPS+ATP treatment induced a clear increase in the mean cluster density compared to quiescent (untreated) BV2 cells. This effect closely matches the results obtained with tStrept-A 647, supporting the conclusion the tetrameric streptavidin does not artificially promote P2X7 clustering. It is also possible that the cellular environment of BV2 microglia differs from the confined architecture of synapses, which may further explain why cross-linking effects are less pronounced in our system.

      As expected, the overall fluorescence signal with mSA was about tenfold lower than with tStrept-A 647, consistent with the expected fluorophore stoichiometry. This lower signal may explain why the values for the untreated condition appeared slightly higher than for the control, although the difference was not statistically significant (P = 0.1455).

      We hope these additional experiments adequately address the reviewer’s concerns.

      Author response image 1.

      BV2 labeling with monomeric streptavidin–Alexa 647 (mSA).(A) Bright-field and dSTORM images of BV2 cells labeled with mSA in the presence (untreated and LPS+ATP) or absence (control) of 1 µM X7-uP. Treatment: LPS (1 µg/mL for 24 hours) and ATP (1 mM for 30 minutes). Scale bars, 10 µm. Insets: Magnified dSTORM images. Scale bars, 1 µm.(B) Quantification of the number of localizations (n = 2 independent experiments). Bars represent mean ± s.e.m. One-way ANOVA with Tukey’s multiple comparisons (P values are indicated above the graph).

      (6) It is better to directly label Alexa647 to the P2X7 receptor to avoid functional perturbation of P2X7. 

      Directly labeling of Alexa647 to the P2X7 receptor would require the design and synthesis of a novel probe, which is currently not available. Implementing such a strategy would involve substantial new experimental work that lies beyond the scope of the present study.

      (7) In all imaging experiments, the addition of streptavidin, which acts as a cross-linking agent, may induce P2X7 receptor clustering. This concern would be dispelled if the receptors were labeled with a fluorescent dye instead of biotin and observed. 

      We refer the reviewer to our response in point 5, where we addressed this concern by comparing tetrameric and monomeric streptavidin conjugates. As noted above (see also point 6), directly labeling the receptor with a fluorescent dye would require the development of a new probe, which is outside the scope of the present study.

      (8) There are several mentions of microglia in this paper, even though they are not used. This can lead to misunderstanding for the reader. The author conducted functional analysis of the P2X7 receptor in BV-2 cells, which are a model cell line but not microglia themselves. The text should be reviewed again and corrected to remove the misleading parts that could lead to misunderstanding. e.g. P8. lines 361-364

      First, it combines N-cyanomethyl NASA chemistry with the high-affinity AZ10606120 ligand, enabling rapid labeling in microglia (within 10 min)

      P8. lines 372-373 

      Our results not only confirm P2X7 expression in microglia, as previously reported (6, 26-33), but also reveal its nanoscale localization at the cell surface using dSTORM. 

      We agree with the reviewer’s comment. We have now modified the text, including the title.

      Reviewer #2 (Public review): 

      Summary: 

      In this manuscript, Arnould et. al. develop an unbiased, affinity-guided reagent to label P2X7 receptor and use super-resolution imaging to monitor P2X7 redistribution in response to inflammatory signaling. 

      Strengths: 

      I think the X7-uP probe that they developed is very useful for visualizing localization of P2X7 receptor. They convincingly show that under inflammatory conditions, there is a reorganization of P2X7 localization into receptor clusters. Moreover, I think they have shown a very clever way to specifically label any receptor of interest. This has broad appeal 

      We thank the reviewer for their positive comment.

      Weaknesses: 

      Overall, the manuscript is novel and interesting. However, I do have some suggestions for improvement. 

      (1) While the authors state that chemical modification of AZ10606120 to produce the X7-UP reagent has "minimal impact" on the inhibition of P2X7, we can see from Figure 2A and 2B that it does not antagonize P2X7 as effectively as the original antagonist. For the sake of completeness and quantitation, I think it would be great if the authors could determine the IC50 for X7-uP and compare it to the IC50 of AZ10606120. 

      We thank the reviewer for this insightful comment. Unfortunately, due to the limited availability of X7-uP, we were not able to establish a complete concentration–response curve to determine its IC<sub>50</sub>, which would require testing at concentrations >1 µM. Nevertheless, to estimate the effect of the modification, we assessed current inhibition at 300 µM X7-uP and compared it with the reported IC<sub>50</sub> of AZ10606120 (10 nM). Under these conditions, both compounds produced a similar level of inhibition, indicating that while the chemical modification reduces potency relative to AZ10606120, X7-uP still functions as an effective probe for P2X7. We have now included these data in Figure 2 and revised the text accordingly.

      (2) Do the authors know whether modification of the lysines with biotin affects the receptor's affinity for ATP (or ability to be activated by ATP)? What about P2X7 that has been modified with biotin and then labeled with Alexa 647? For the sake of completeness and quantitation, I think it would be great if the authors could determine the EC50 of biotinylated P2X7 for ATP as well as biotinylated and then Alexa 647 labeled P2X7 for ATP and compare these values to the affinity of unmodified WT P2X7 for ATP.

      We thank the reviewer for raising this important point. At present, we have not determined whether modification of lysine residues with biotin, or subsequent labeling with Alexa647, affects the ATP sensitivity or functional properties of P2X7. However, we believe this does not impact the conclusions of the current study, as all functional assays were conducted prior to X7-uP labeling. The labeling is used here as a terminal "snapshot" to visualize the endogenous receptor without interfering with the functional characterization.

      We fully agree that assessing the functional integrity of P2X7 following biotinylation and fluorophore labeling—such as by determining the EC<sub>50</sub> for ATP—would be essential for studies involving dynamic or post-labeling functional analyses, such as live trafficking. However, as noted earlier in our response to Reviewer 1 (point 4), these experiments lie beyond the scope of the current study.

      (3) It is a little misleading to color the fluorescence signal from mScarlet green (for example, in Figure 3 and Figure 4). The fluorescence is not at the same wavelength as GFP. In fact, the wavelength (570 nm - 610 nm) for emission is closer to orange/red than to green. I think this color should be changed to differentiate the signal of mScarlet from the GFP signal used for each of the other P2X receptor subtypes. 

      As suggested, we changed the mScarlet color to orange for all relevant figures.

      (4) It is my understanding that P2X6 does not form homotrimers. Thus, I was a little surprised to see that the density and distribution of P2X6-GFP in Figure 3 looks very similar to the density and distribution of the other P2X subtypes. Do the authors have an explanation for this? Are they looking at P2X6 protomers inserted into the plasma membrane? Does the cell line have endogenous P2X receptor subtypes? Is Figure 3 showing heterotrimers with P2X6 receptor? A little explanation might be helpful.

      We thank the reviewer for raising this important point. Indeed, it is well established that P2X6 does not form functional channels, which supports the conclusion that it does not form homotrimeric complexes. Although previous studies have shown that P2X6–GFP expression is generally lower, more diffuse, and not efficiently targeted to the cell surface compared with other P2X subtypes (see PMID: 12077178), the similar fluorescence distribution and density observed in our Figure 3 do not imply that P2X6 forms homotrimers.

      We did not directly assess the presence of endogenous P2X6 in our HEK293T cells; however, according to the Human Protein Atlas, there is no detectable P2X6 RNA expression in HEK293 cells (nTPM = 0), indicating that endogenous P2X6 is not expressed in this cell line. To further investigate surface expression (see also point 2 of reviewer 1), we performed a commercial cell-surface protein biotinylation assay to assess whether GFP-tagged P2X6 reaches the plasma membrane. As expected, P2X6 was not detected at the cell surface in HEK293T cells, whereas GFP-tagged P2X1 to P2X5 were readily detected. These results further support the conclusion that P2X6 does not insert into the plasma membrane as a homotrimer, thereby validating our confocal fluorescence microscopy assay. These new data are now included in Figure 3 — figure supplement 1.

      (5) It is easy to overlook the fact that the antagonist leaves the binding pocket once the biotin has been attached to the lysines. It might be helpful if the authors made this a little more apparent in Figure 1 or in the text describing the NASA chemistry reaction.

      We thank the reviewer for this insightful suggestion. To address this, we have modified Figure 1A and updated the legend.

      Reviewer #3 (Public review): 

      Summary: 

      This manuscript describes the development of a covalent labeling probe (X7-uP) that selectively targets and tags native P2X7 receptors at the plasma membrane of BV2 microglial cells. Using super-resolution imaging (dSTORM), the authors demonstrate that P2X7 receptors form nanoscale clusters upon microglial activation by lipopolysaccharide (LPS) and ATP, correlating with synergistic IL-1β release. These findings advance understanding of P2X7 reorganization during inflammation and provide a generalizable labeling strategy for monitoring endogenous P2X7 in immune cells. 

      Strengths: 

      (1) The authors designed X7-uP by coupling a high-affinity, P2X7-specific antagonist (AZ10606120) with N-cyanomethyl NASA chemistry to achieve site-directed biotinylation. This approach offers high specificity, minimal off-target reactivity, and a straightforward pull-down/imaging readout. 

      (2) The results connect P2X7's nanoscale clustering directly with IL-1β secretion in microglia, reinforcing the role of P2X7 in inflammation. By localizing endogenous P2X7 at single-molecule resolution, the authors reveal how LPS priming and ATP stimulation synergistically reorganize the receptor. 

      (3) The authors systematically validate their method in recombinant systems (HEK293 cells) and in BV2 cells, showing selective inhibition, mutational confirmation of the binding site, and Western blot pulldown experiments.

      We thank the reviewer for their positive comment.

      Weaknesses: 

      (1) While the data strongly indicate that P2X7 clustering contributes to IL-1β release, the manuscript would benefit from additional experiments (if feasible) or discussion on how receptor clustering interfaces with downstream inflammasome assembly. Clarification of whether the P2X7 clusters physically colocalize with known inflammasome proteins would solidify the mechanism. 

      We thank the reviewer for this valuable suggestion. Determining the physical colocalization of P2X7 clusters with known inflammasome components would provide important insight into the molecular partners involved in inflammasome activation. However, we believe that such an investigation would constitute a substantial study on its own and therefore lies beyond the scope of the present work.

      Nevertheless, in response to the reviewer’s suggestion, we have added a short paragraph at the end of the Discussion section addressing potential mechanisms by which P2X7 clustering may contribute to downstream inflammasome activation. We also revised the text to tone down the hypothesis of physical colocalization.

      (2) The authors might expand on the scope of X7-uP in other native cells that endogenously express P2X7 (e.g., macrophages, dendritic cells). Although they mention the possibility, demonstrating the probe's applicability in at least one other primary immune cell type would strengthen its general utility. 

      We thank the reviewer for this valuable suggestion. Again, we believe that such an investigation would constitute a substantial study on its own and therefore lies beyond the scope of the present work.

      (3) The authors do include appropriate negative controls, yet providing additional details (e.g., average single-molecule on-time or blinking characteristics) in supplementary materials could help readers assess cluster calculations. 

      As suggested, we have included additional data showing single-molecule blinking events in untreated and LPS+ATP-treated BV2 cells, along with the corresponding movies. The data are now presented in Figure 5—supplement figure 3A and B and Figure 5—Videos 1 and 2.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors): 

      (1) On line 96, the authors refer to the "ballast" domain of P2X7 receptor but do not cite the original article from which this nomenclature originated (McCarthy et al., 2019, Cell). This article should be cited to give appropriate credit. 

      Done.

      (2) On line 602, the authors state that they use models from PDB 1MK5 and 6U9W to generate the cartoons in Figure 6. The manuscripts from which these PDB files were generated need to be appropriately cited. 

      Done.

      (3) On line 319, the authors say "300 mM BzATP" but I think they mean 300 uM.

      Done. Thank you for catching the typo.

      Reviewer #3 (Recommendations for the authors): 

      Overall, excellent data quality. The paper would benefit from a discussion of the physiological implications of clustering. It would also be helpful to elaborate about the potential mechanisms for clustering: diffusion and/or insertion. Finally, the authors should comment on work by Mackinnon's (PMID: 39739811) and Santana lab (PMID: 31371391) on two distinct models for clustering of proteins. 

      As suggested by the reviewer, we have revised the discussion to incorporate their comments. First, we have added the following text:

      “Upon BV2 activation, we observed significant nanoscale reorganization of P2X7. Both LPS and ATP (or BzATP) trigger P2X7 upregulation and clustering, increasing the overall number of surface receptors and the number of receptors per cluster, from one to three (Figure 6). By labeling BV2 cells with X7-uP shortly after IL-1b release, we were able to correlate the nanoscale distribution of P2X7 with the functional state of BV2 cells, consistent with the two-signal, synergistic model for IL-1b secretion observed in microglia and other cell types (Ferrari et al, 1996; Perregaux et al, 2000; Ferrari et al, 2006; Di Virgilio et al, 2017; He et al, 2017; Swanson et al, 2019). In this model, LPS priming leads to intracellular accumulation of pro-IL-1b, while ATP stimulation activates P2X7, triggering NLRP3 inflammasome activation and the subsequent release of mature IL-1b.

      What is the mechanism underlying P2X7 upregulation that leads to an overall increase in surface receptors—does it result from the lateral diffusion of previously masked receptors already present at the plasma membrane, or from the insertion of newly synthesized receptors from intracellular pools in response to LPS and ATP? Although our current data do not distinguish between these possibilities, a recent study suggests that the a1 subunit of the Na<sup>+</sup>/K</sup>+</sup>-ATPase (NKAa1) forms a complex with P2X7 in microglia, including BV2 cells, and that LPS+ATP induces NKAa1 internalization (Huang et al, 2024). This internalization appears to release P2X7 from NKAa1, allowing P2X7 to exist in its free form. We speculate that the internalization of NKAa1 induced by both LPS and ATP exposes previously masked P2X7 sites, including the allosteric AZ10606120 sites, thus making them accessible for X7-uP labeling.”

      Second, we have added a short paragraph at the end of the Discussion section addressing potential mechanisms by which P2X7 clustering may contribute to downstream inflammasome activation:

      “What mechanisms underlie P2X7 clustering in response to inflammatory signals? Several models have been proposed to explain membrane protein clustering, including recruitment to structural scaffolds (Feng & Zhang, 2009), partitioning into membrane domains enriched in specific chemical components such as lipid rafts (Simons & Ikonen, 1997), and self-assembly mechanisms (Sieber et al, 2007). These self-assembly mechanisms include an irreversible stochastic model (Sato et al, 2019) and a more recent reversible self-oligomerization model which gives rise to higher-order transient structures (HOTS) (Zhang et al, 2025). Supported by cryogenic optical localization microscopy with very high resolution (~5 nm), the HOTS model has been observed in various membrane proteins, including ion channels and receptors (Zhang et al, 2025). Furthermore, HOTS are suggested to be dynamically modulated and to play a functional role in cell signaling, potentially influencing both physiological and pathological processes (Zhang & MacKinnon, 2025). While this hypothesis is compelling, our current dSTORM data lack sufficient spatial resolution to confirm whether P2X7 trimers form HOTS via self-oligomerization. Further biophysical and ultra-high-resolution imaging studies are required to test this model in the context of P2X7 clustering.”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Summary:

      This manuscript by Pournejati et al investigates how BK (big potassium) channels and CaV1.3 (a subtype of voltage-gated calcium channels) become functionally coupled by exploring whether their ensembles form early-during synthesis and intracellular trafficking-rather than only after insertion into the plasma membrane. To this end, the authors use the PLA technique to assess the formation of ion channel associations in the different compartments (ER, Golgi or PM), single-molecule RNA in situ hybridization (RNAscope), and super-resolution microscopy.

      Strengths:

      The manuscript is well written and addresses an interesting question, combining a range of imaging techniques. The findings are generally well-presented and offer important insights into the spatial organization of ion channel complexes, both in heterologous and endogenous systems.

      Weaknesses:

      The authors have improved their manuscript after revisions, and some previous concerns have been addressed.

      Still, the main concern about this work is that the current experiments do not quantitatively or mechanistically link the ensembles observed intracellularly (in the endoplasmic reticulum (ER) or Golgi) to those found at the plasma membrane (PM). As a result, it is difficult to fully integrate the findings into a coherent model of trafficking. Specifically, the manuscript does not address what proportion of ensembles detected at the PM originated in the ER. Without data on the turnover or halflife of these ensembles at the PM, it remains unclear how many persist through trafficking versus forming de novo at the membrane. The authors report the percentage of PLApositive ensembles localized to various compartments, but this only reflects the distribution of pre-formed ensembles. What remains unknown is the proportion of total BK and Ca<sub>V</sub>1.3 channels (not just those in ensembles) that are engaged in these complexes within each compartment. Without this, it is difficult to determine whether ensembles form in the ER and are then trafficked to the PM, or if independent ensemble formation also occurs at the membrane. To support the model of intracellular assembly followed by coordinated trafficking, it would be important to quantify the fraction of the total channel population that exists as ensembles in each compartment. A comparable ensemble-to-total ratio across ER and PM would strengthen the argument for directed trafficking of pre-assembled channel complexes.

      We appreciate the reviewer’s thoughtful comment and agree that quantitatively linking intracellular hetero-clusters to those at the plasma membrane is an important and unresolved question. Our current study does not determine what proportion of ensembles at the plasma membrane originated during trafficking. It also does not quantify the fraction of total BK and Ca<sub>V</sub>1.3 channels engaged in these complexes within each compartment. Addressing this requires simultaneous measurement of multiple parameters—total BK channels, total Ca<sub>V</sub>1.3 channels, hetero-cluster formation (via PLA), and compartment identity—in the same cell. This is technically challenging. The antibodies used for channel detection are also required for the proximity ligation assay, which makes these measurements incompatible within a single experiment.

      To overcome these limitations, we are developing new genetically encoded tools to enable real-time tracking of BK and Ca<sub>V</sub>1.3 dynamics in live cells. These approaches will enable us to monitor channel trafficking and the formation of hetero-clusters, as detected by colocalization. This kind of experiments will provide insight into their origin and turnover. While these experiments are beyond the scope of the current study, the findings in our current manuscript provide the first direct evidence that BK and CaV channels can form hetero-clusters intracellularly prior to reaching the plasma membrane. This mechanistic insight reveals a previously unrecognized step in channel organization and lays the foundation for future work aimed at quantifying ensemble-to-total ratios and determining whether coordinated trafficking of pre-assembled complexes occurs.

      This limitation is acknowledged in the discussion section, page 23. It reads: “Our findings highlight the intracellular assembly of BK-Ca<sub>V</sub>1.3 hetero-clusters, though limitations in resolution and organelle-specific analysis prevent precise quantification of the proportion of intracellular complexes that ultimately persist on the cell surface.”

      Reviewer #2 (Public review):

      Summary:

      The co-localization of large conductance calcium- and voltage activated potassium (BK) channels with voltage-gated calcium channels (CaV) at the plasma membrane is important for the functional role of these channels in controlling cell excitability and physiology in a variety of systems.

      An important question in the field is where and how do BK and CaV channels assemble as 'ensembles' to allow this coordinated regulation - is this through preassembly early in the biosynthetic pathway, during trafficking to the cell surface or once channels are integrated into the plasma membrane. These questions also have broader implications for assembly of other ion channel complexes

      Using an imaging based approach, this paper addresses the spatial distribution of BKCaV ensembles using both overexpression strategies in tsa201 and INS-1 cells and analysis of endogenous channels in INS-1 cells using proximity ligation and superesolution approaches. In addition, the authors analyse the spatial distribution of mRNAs encoding BK and Cav1.3.

      The key conclusion of the paper that BK and Ca<sub>V</sub>1.3 are co-localised as ensembles intracellularly in the ER and Golgi is well supported by the evidence.However, whether they are preferentially co-translated at the ER, requires further work. Moreover, whether intracellular pre-assembly of BK-Ca<sub>V</sub>1.3 complexes is the major mechanism for functional complexes at the plasma membrane in these models requires more definitive evidence including both refinement of analysis of current data as well as potentially additional experiments.

      The reviewer raises the question of whether BK and Ca<sub>V</sub>1.3 channels are preferentially co-translated. In fact, I would like to propose that co-translation has not yet been clearly defined for this type of interaction between ion channels. In our current work, we 1) observed the colocalization between BK and Ca<sub>V</sub>1.3 mRNAs and 2) determined that 70% of BK mRNA in active translation also colocalizes with Ca<sub>V</sub>1.3 mRNA. We think these results favor the idea of translational complexes that can underlie the process of co-translation. However, and in total agreement with the Reviewer, the conclusion that the mRNA for the two ion channels is cotranslated would require further experimentation. For instance, mRNA coregulation is one aspect that could help to define co-translation. 

      To avoid overinterpretation, we have revised the manuscript to remove references to “co-translation” in the Results section and included the word “potential” when referring to co-translation in the Discussion section. We also clarified the limitations of our evidence in the Discussion that can be found on page 25: “It is important to note that while our data suggest mRNA coordination, additional experiments are required to directly assess co-translation.”

      Strengths & Weaknesses

      (1) Using proximity ligation assays of overexpressed BK and CaV1.3 in tsa201 and INS1 cells the authors provide strong evidence that BK and CaV can exist as ensembles (ie channels within 40 nm) at both the plasma membrane and intracellular membranes, including ER and Golgi. They also provide evidence for endogenous ensemble assembly at the Golgi in INS-1 cells and it would have been useful to determine if endogenous complexes are also observe in the ER of INS-1 cells. There are some useful controls but the specificity of ensemble formation would be better determined using other transmembrane proteins rather than peripheral proteins (eg Golgi 58K).

      We thank the reviewer for their thoughtful feedback and for recognizing the strength of our proximity ligation assay data supporting BK–Ca<sub>V</sub>1.3 hetero-clusters formation at both the plasma membrane and intracellular compartments. As for specificity controls, we appreciate the suggestion to use transmembrane markers. To strengthen our conclusion, we have performed an additional experiment comparing the number of PLA puncta formed by the interaction of Ca<sub>V</sub>1.3 and BK channels with the number of PLA puncta formed by the interaction of Ca<sub>V</sub>1.3 channels and ryanodine receptors in INS-1 cells. As shown in the figure below, the number of interactions between Ca<sub>V</sub>1.3 and BK channels is significantly higher than that between Ca<sub>V</sub>1.3 and RyR<sub>2</sub>. Of note, RyR<sub>2</sub> is a protein resident of the ER. These results provide additional evidence of the existence of endogenous complex formation in INS-1 cells. We have added this figure as a supplement.

      (2) Ensemble assembly was also analysed using super-resolution (dSTORM) imaging in INS-1 cells. In these cells only 7.5% of BK and CaV particles (endogenous?) co-localise that was only marginally above chance based on scrambled images. More detailed quantification and validation of potential 'ensembles' needs to be made for example by exploring nearest neighbour characteristics (but see point 4 below) to define proportion of ensembles versus clusters of BK or Cav1.3 channels alone etc. For example, it is mentioned that a distribution of distances between BK and Cav is seen but data are not shown.

      We thank the reviewer for this comment. To address the request for more detailed quantification and validation of ensembles, we performed additional analyses:

      Proportion of ensembles vs isolated clusters: We quantified clusters within 200 nm and found that 37 ± 3% of BK clusters are near one or more CaV1.3 clusters, whereas 15 ± 2% of CaV1.3 clusters are near BK clusters. Figure 8– Supplementary 1A

      Distance distribution: As shown in Figure 8–Supplementary 1B, the nearestneighbor distance distribution for BK-to-CaV1.3 in INS-1 cells (magenta) is shifted toward shorter distances compared to randomized controls (gray), supporting preferential localization of BK–CaV1.3 hetero-clusters.

      Together, these analyses confirm that BK–CaV1.3 ensembles occur more frequently than expected by chance and exhibit an asymmetric organization favoring BK proximity to CaV1.3 in INS-1 cells. We have included these data and figures in the revised manuscript, as well as description in the Results section. 

      (3) The evidence that the intracellular ensemble formation is in large part driven by cotranslation, based on co-localisation of mRNAs using RNAscope, requires additional critical controls and analysis. The authors now include data of co-localised BK protein that is suggestive but does not show co-translation. Secondly, while they have improved the description of some controls mRNA co-localisation needs to be measured in both directions (eg BK - SCN9A as well as SCN9A to BK) especially if the mRNAs are expressed at very different levels. The relative expression levels need to be clearly defined in the paper. Authors also use a randomized image of BK mRNA to show specificity of co-localisation with Cav1.3 mRNA, however the mRNA distribution would not be expected to be random across the cell but constrained by ER morphology if cotranslated so using ER labelling as a mask would be useful?

      We thank the reviewer for these constructive suggestions. We measured mRNA colocalization in both directions as recommended. As shown in the figure below, colocalization between KCNMA1 and SCN9A transcripts was comparable in both directions, with no statistically significant difference, supporting the specificity of the observed associations. We decided not to add this to the original figure to keep the figure simple. 

      We agree that co-localization of BK protein with BK mRNA is not conclusive evidence of co-translation, and we do not intend to mislead readers in our conclusion. Consequently, we were careful in avoiding the use of co-translation in the result section and added the word “potential” when referring to co-translation in the Discussion section. We added a sentence in the discussion to caution our interpretation: “It is important to note that while our data suggest mRNA coordination, additional experiments are required to directly assess cotranslation.”

      Author response image 1.

      (4) The authors attempt to define if plasma membrane assemblies of BK and CaV occur soon after synthesis. However, because the expression of BK and CaV occur at different times after transient transfection of plasmids more definitive experiments are required. For example, using inducible constructs to allow precise and synchronised timing of transcription. This would also provide critical evidence that co-assembly occurs very early in synthesis pathways - ie detecting complexes at ER before any complexes 

      We appreciate the reviewer’s insightful suggestion regarding the use of inducible constructs to synchronize transcription timing. This is an excellent approach and would allow direct testing of whether co-assembly occurs early in the synthesis pathway, including detection of complexes at the ER prior to plasma membrane localization. These experiments are beyond the scope of the present work but represent an important direction for future studies.

      We have added the following sentence to the Discussion section (page 24) to highlight this idea. “Future experiments using inducible constructs to precisely control transcription timing will enable more precise quantification of heterocluster formation in the ER compartment prior to plasma membrane insertion and reduce the variability introduced by differences in expression timing after plasmid transfection.” 

      (5) While the authors have improved the definition of hetero-clusters etc it is still not clear in superesolution analysis, how they separate a BK tetramer from a cluster of BK tetramers with the monoclonal antibody employed ie each BK channel will have 4 binding sites (4 subunits in tetramer) whereas Cav1.3 has one binding site per channel. Thus, how do authors discriminate between a single BK tetramer (molecular cluster) with potential 4 antibodies bound compared to a cluster of 4 independent BK channels.

      We appreciate the reviewer’s thoughtful comment regarding the interpretation of super-resolution data. We agree that distinguishing a single BK tetramer from a cluster of multiple BK channels is challenging when using an antibody that can bind up to four sites per channel. To clarify, our analysis does not attempt to resolve individual subunits within a tetramer; rather, it focuses on the nanoscale spatial proximity of BK and Ca<sub>V</sub>1.3 signals.

      We want to note that this limitation applies only to the super-resolution maps in Figures 8C and 9D and does not affect Airyscan-based analyses or measurements of BK–Ca<sub>V</sub>1.3 proximity.

      To address how we might distinguish between a single BK tetramer and a cluster of multiple BK channels, we considered two contrasting scenarios. In the first case, we assume that all four α-subunits within a tetramer are labeled. Based on cryoEM structures, a BK tetramer measures approximately 13 nm × 13 nm (≈169 nm²). Adding two antibody layers (primary and secondary) would increase the footprint by ~14 nm in each direction, resulting in an estimated area of ~41 nm × 41 nm (≈1681 nm²). Under this assumption, particles smaller than ~1681 nm² would likely represent individual tetramers, whereas larger particles would correspond to clusters of multiple tetramers. 

      In the second scenario, we propose that steric constraints at the S9–S10 segment, where the antibody binds, limit labeling to a single antibody per tetramer. If true, the localization precision would approximate 14 nm × 14 nm—the combined size of the antibody complex and the channel—close to the resolution limit of the microscope. To test this, we performed a control experiment using two antibodies targeting the BK C-terminal domain, raised in different species and labeled with distinct fluorophores. Super-resolution imaging revealed that only ~12% of particles were colocalized, suggesting that most channels bind a single antibody.

      If multiple antibodies could bind each tetramer, we would expect much greater colocalization.

      Although these data are not included in the manuscript, we have added the following clarification to the Results section (page 19): “It is important to note that this technique does not allow us to distinguish between labeling of four BK αsubunits within a tetramer and labeling of multiple BK channel clusters. Hence, particles smaller than ~1680 nm² may represent either a single tetramer or a cluster. This limitation applies to Figures 8C and 9D and does not affect measurements of BK–Ca<sub>V</sub>1.3 proximity.”

      Author response image 2.

      (6) The post-hoc tests used for one way ANOVA and ANOVA statistics need to be defined throughout

      We thank the reviewer for highlighting the need for clarity regarding our statistical analyses. We have now specified the post-hoc tests used for all one-way ANOVA and ANOVA comparisons throughout the manuscript, and updated figure legends.

      Reviewer #3 (Public review):

      Summary:

      The authors present a clearly written and beautifully presented piece of work demonstrating clear evidence to support the idea that BK channels and Cav1.3 channels can co-assemble prior to their assertion in the plasma membrane.

      Strengths:

      The experimental records shown back up their hypotheses and the authors are to be congratulated for the large number of control experiments shown in the ms.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors have sufficiently addressed the specific points previously raised and the manuscript has improved clarity in those aspects. My main concern, which still remains, is stated in the public review.

      Reviewer #3 (Recommendations for the authors):

      I am content that the authors have attempted to fully address my previous criticisms.

      I have only three suggestions

      (1) I think the word Homo-clusters at the bottom right of Figure 1 is erroneously included.

      We thank the reviewer for bringing this to our attention. The figure has been corrected accordingly.

      (2) The authors should, for completeness, to refer to the beta, gamma and LINGO subunit families in the Introduction and include appropriate references:

      Knaus, H. G., Folander, K., Garcia-Calvo, M., Garcia, M. L., Kaczorowski, G. J., Smith, M., & Swanson, R. (1994). Primary sequence and immunological characterization of betasubunit of high conductance Ca2+-activated K+ channel from smooth muscle. The Journal of Biological Chemistry, 269(25), 17274-17278.

      Brenner, R., Jegla, T. J., Wickenden, A., Liu, Y., & Aldrich, R. W. (2000a). Cloning and functional characterization of novel large conductance calcium-activated potassium channel beta subunits, hKCNMB3 and hKCNMB4. The Journal of Biological Chemistry, 275(9), 6453-6461.

      Yan, J & R.W. Aldrich. (2010) LRRC26 auxiliary protein allows BK channel activation at resting voltage without calcium. Nature. 466(7305):513-516

      Yan, J & R.W. Aldrich. (2012) BK potassium channel modulation by leucine-rich repeatcontaining proteins. Proceedings of the National Academy of Sciences 109(20):7917-22

      Dudem, S, Large RJ, Kulkarni S, McClafferty H, Tikhonova IG, Sergeant, GP, Thornbury, KD, Shipston, MJ, Perrino BA & Hollywood MA (2020). LINGO1 is a novel regulatory subunit of large conductance, Ca2+-activated potassium channels. Proceedings of the National Academy of Sciences 117 (4) 2194-2200

      Dudem, S., Boon, P. X., Mullins, N., McClafferty, H., Shipston, M. J., Wilkinson, R. D. A., Lobb, I., Sergeant, G. P., Thornbury, K. D., Tikhonova, I. G., & Hollywood, M. A. (2023). Oxidation modulates LINGO2-induced inactivation of large conductance, Ca2+-activated potassium channels. The Journal of Biological Chemistry, 299 (3) 102975.

      We agree with the reviewer’s suggestion and have revised the Introduction to include references to the beta, gamma, and LINGO subunit families. Appropriate citations have been added to ensure completeness and contextual relevance.

      Additionally, BK channels are modulated by auxiliary subunits, which fine-tune BK channel gating properties to adapt to different physiological conditions. The β, γ, and LINGO1 subunits each contribute distinct structural and regulatory features: β-subunits modulate Ca²⁺ sensitivity and can induce inactivation; γ-subunits shift voltage-dependent activation to more negative potentials; and LINGO1 reduces surface expression and promotes rapid inactivation (18-24). These interactions ensure precise control over channel activity, allowing BK channels to integrate voltage and calcium signals dynamically in various cell types.

      (3) I think it may be more appropriate to include the sentence "The probes against the mRNAs of interest and tested in this work were designed by Advanced Cell Diagnostics." (P16, right hand column, L12-14) in the appropriate section of the Methods, rather than in Results.

      We thank the reviewer for this helpful suggestion. In response, we have relocated the sentence to the appropriate section of the Methods, where it now appears with relevant context.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We would like to thank the reviewers for their efforts and feedback on our preprint. We have elected to rework the manuscript for publication in a different journal. In this process we will alter many of the approaches and re-evaluate the conclusions. With this, many of the points raised by the reviewers will be no longer relevant and therefore do not require a response. Again, we thank the reviewers for their time and helpful feedback.


      The following is the authors’ response to the original reviews.

      eLife Assessment:

      The authors present a potentially useful approach of broad interest arguing that anterior cingulate cortex (ACC) tracks option values in decisions involving delayed rewards. The authors introduce the idea of a resource-based cognitive effort signal in ACC ensembles and link ACC theta oscillations to a resistance-based strategy. The evidence supporting these new ideas is incomplete and would benefit from additional detail and more rigorous analyses and computational methods.

      We are extremely grateful for the several excellent and comments of the reviewers. To address these concerns, we have completely reworked the manuscript adding more rigorous approaches in each phase of the analysis and computational model. We realize that this has taken some time to prepare the revision. However, given the comments of the reviewers, we felt it necessary to thoroughly rework the paper based on their input. Here is a (nonexhaustive) overview of the major changes we made:

      We have developed a way to more adequately capture the heterogeneity in the behavior

      We have completely reworked the RL model

      We have added additional approaches and rigor to the analysis of the value-tracking signal. 

      Reviewer #1 (Public Review):

      Summary:

      Young (2.5 mo [adolescent]) rats were tasked to either press one lever for immediate reward or another for delayed reward. 

      Please note that at the time of testing and training that the rats were > 4 months old. 

      The task had a complex structure in which (1) the number of pellets provided on the immediate reward lever changed as a function of the decisions made, (2) rats were prevented from pressing the same lever three times in a row. Importantly, this task is very different from most intertemporal choice tasks which adjust delay (to the delayed lever), whereas this task held the delay constant and adjusted the number of 20 mg sucrose pellets provided on the immediate value lever.

      Several studies parametrically vary the immediate lever (PMID: 39119916, 31654652, 28000083, 26779747, 12270518, 19389183). While most versions of the task will yield qualitatively similar estimates of discounting, the adjusting amount is preferred as it provides the most consistent estimates (PMID: 22445576). More specifically this version of the task avoids contrast effects of that result from changing the delay during the session (PMID: 23963529, 24780379, 19730365, 35661751) which complicates value estimates. 

      Analyses are based on separating sessions into groups, but group membership includes arbitrary requirements and many sessions have been dropped from the analyses. 

      We have updated this approach and now provide a more comprehensive assessment of the behavior. The updated approach applies a hierarchical clustering model to the behavior in each session. This was applied at each delay to separate animals that prefer the immediate option more/less. This results in 4 statistically dissociable groups (4LO, 4HI, 8LO, 8HI) and includes all sessions. Please see Figure 1. 

      Computational modeling is based on an overly simple reinforcement learning model, as evidenced by fit parameters pegging to the extremes. 

      We have completely reworked the simulations in the revision. In the updated RL model we carefully add parameters to determine which are necessary to explain the experimental data. We feel that it is simplified yet more descriptive. Please see Figure 2 and associated text. 

      The neural analysis is overly complex and does not contain the necessary statistics to assess the validity of their claims.

      We have dramatically streamlined the spike train analysis approach and added several statistical tests to ensure the rigor of our results. Please see Figures 4,5,6 and associated text. 

      Strengths:

      The task is interesting.

      Thank you for the positive comment

      Weaknesses:

      Behavior:

      The basic behavioral results from this task are not presented. For example, "each recording session consisted of 40 choice trials or 45 minutes". What was the distribution of choices over sessions? Did that change between rats? Did that change between delays? Were there any sequence effects? (I recommend looking at reaction times.) Were there any effects of pressing a lever twice vs after a forced trial? 

      Please see the updated statistics and panels in Figures 1 and 2. We believe these address this valid concern.  

      This task has a very complicated sequential structure that I think I would be hard pressed to follow if I were performing this task. 

      Human tasks implement a similar task structure (PMID: 26779747). Please note the response above that outlines the benefits of using of this task.   

      Before diving into the complex analyses assuming reinforcement learning paradigms or cognitive control, I would have liked to have understood the basic behaviors the rats were taking. For example, what was the typical rate of lever pressing? If the rats are pressing 40 times in 45 minutes, does waiting 8s make a large difference?

      Thank you for this suggestion. Our additions to Figure 1 are intended to better explain and quantify the behavior of the animals. Note that this task is designed to hold the rate of reinforcement constant no matter the choices of the animals. Our analysis supports the long-held view in the literature that rats do not like waiting for rewards, even at small delays. Going from the 4 à 8 sec delay results in significantly more immediate choices, indicating that the rats will forgo waiting 8 sec for a larger reinforcer and take a smaller reinforcer at 4 sec.  

      For that matter, the reaction time from lever appearance to lever pressing would be very interesting (and important). Are they making a choice as soon as the levers appear? Are they leaning towards the delay side, but then give in and choose the immediate lever? What are the reaction time hazard distributions?

      This is an excellent suggestion, we have added a brief analysis of reaction times (Please see the section entitled “4 behavioral groups are observed across all sessions” in the Results). Please note that an analysis of the reaction times has been presented in a prior analysis of this data set (White et al., 2024). In addition, an analysis of reaction times in this task was performed in Linsenbardt et al. (2017). In short, animals tend to choose within 1 second of the lever appearing. In addition, our prior work shows that responses on the immediate lever tend to be slower, which we viewed as evidence of increased deliberation requirements (possibly required to integrate value signals).   

      It is not clear that the animals on this task were actually using cognitive control strategies on this task. One cannot assume from the task that cognitive control is key. The authors only consider a very limited number of potential behaviors (an overly simple RL model). On this task, there are a lot of potential behavioral strategies: "win-stay/lose-shift", "perseveration", "alternation", even "random choices" should be considered.

      The strategies the Reviewer mentioned are descriptors of the actual choices the rats made. For example, perseveration means the rat is choosing one of the levers at an excessively high rate whereas alternation means it is choosing the two levers more or less equally, independent of payouts. But the question we are interested in is why? We are arguing that the type of cognitive control determines the choice behavior, but cognitive control is an internal variable that guides behavior, rather than simply a descriptor of the behavior. For example, the animal opts to perseverate on the delayed lever because the cognitive control required to track ival is too high. We then searched the neural data for signatures of the two types of cognitive control.

      The delay lever was assigned to the "non-preferred side". How did side bias affect the decisions made?

      The side bias clearly does not impact performance as the animals prefer the delay lever at shorter delays, which works against this bias.  

      The analyses based on "group" are unjustified. The authors compare the proportion of delayed to immediate lever press choices on the non-forced trials and then did k-means clustering on this distribution. But the distribution itself was not shown, so it is unclear whether the "groups" were actually different. They used k=3, but do not describe how this arbitrary number was chosen. (Is 3 the optimal number of clusters to describe this distribution?) Moreover, they removed three group 1 sessions with an 8s delay and two group 2 sessions with a 4s delay, making all the group 1 sessions 4s delay sessions and all group 2 sessions 8s delay sessions. They then ignore group 3 completely. These analyses seem arbitrary and unnecessarily complex. I think they need to analyze the data by delay. (How do rats handle 4s delay sessions? How do rats handle 6s delay sessions? How do rats handle 8s delay sessions?). If they decide to analyze the data by strategy, then they should identify specific strategies, model those strategies, and do model comparison to identify the best explanatory strategy. Importantly, the groups were session-based, not rat based, suggesting that rats used different strategies based on the delay to the delayed lever.

      We have completely reworked our approach for capturing the heterogeneity in behavior. We have taken care to show more of the behavioral statistics that have gone into identifying each of the groups. All sessions are included in this analysis. As the reviewer suggests, we used the statistics from each of the behavioral groups to inform the RL model that explores neural signals that underly decisions in this task. We strongly disagree that groups should be rat and not session based as the behavior of the animal can, and does, change from day to day. This is important to consider when analyzing the neural data as rat-based groupings would ignore this potential source of variance. 

      The reinforcement learning model used was overly simple. In particular, the RL model assumes that the subjects understand the task structure, but we know that even humans have trouble following complex task structures. Moreover, we know that rodent decision-making depends on much more complex strategies (model-based decisions, multi-state decisions, rate-based decisions, etc). There are lots of other ways to encode these decision variables, such as softmax with an inverse temperature rather than epsilon-greedy. The RL model was stated as a given and not justified. As one critical example, the RL model fit to the data assumed a constant exponential discounting function, but it is well-established that all animals, including rodents, use hyperbolic discounting in intertemporal choice tasks. Presumably this changes dramatically the effect of 4s and 8s. As evidence that the RL model is incomplete, the parameters found for the two groups were extreme. (Alpha=1 implies no history and only reacting to the most recent event. Epsilon=0.4 in an epsilongreedy algorithm is a 40% chance of responding randomly.)

      While we agree that the approach was not fully justified, we do not agree that it was invalid. Simply stated, a softmax approach gives the best fit to the choice behavior, whereas our epsilon-greedy approach attempted to reproduce the choice behavior using a naïve agent that progressively learns the values of the two levers on a choice-by-choice basis. Nevertheless, we certainly appreciate that important insights can be gained by fitting a model to the data as suggested. We feel that the new modeling approach we have now implemented is optimal for the present purposes and it replaces the one used in the original manuscript.

      The authors do add a "dbias" (which is a preference for the delayed lever) term to the RL model, but note that it has to be maximal in the 4s condition to reproduce group 2 behavior, which means they are not doing reinforcement learning anymore, just choosing the delayed lever.

      The dbias term was dropped in the new model implementation

      Neurophysiology:

      The neurophysiology figures are unclear and mostly uninterpretable; they do not show variability, statistics or conclusive results.

      While the reviewer is justified in criticizing the clarity of the figures, the statement that “they do not show variability, statistics or conclusive results” is not correct. Each of the figures presented in the first draft of the manuscript, except Figure 3, are accompanied by statistics and measures of variability. Nonetheless we have updated each of the neurophysiology analyses. We hope that the reviewer will find our updates more rigorous and thorough.   

      As with the behavior, I would have liked to have seen more traditional neurophysiological analyses first. What do the cells respond to? How do the manifolds change aligned to the lever presses? Are those different between lever presses?

      We have added several figures that plot the mean +/- SEM of the neural activity (see Figures 4 and 5). Hopefully this provides a more intuitive picture of the changes in neural activity throughout the task.  

      Are there changes in cellular information (both at the individual and ensemble level) over time in the session? 

      We provide several analyses of how firing rate changes over trials in relation to ival over time and trials in the session. In addition, we describe how these signals change in each of the behavioral groups. 

      How do cellular responses differ during that delay while both levers are out, but the rats are not choosing the immediate lever?

      We were somewhat unclear about this suggestion as the delay follows the lever press. In addition, there is no delay after immediate presses 

      Figure 3, for example, claims that some of the principal components tracked the number of pellets on the immediate lever ("ival"), but they are just two curves. No statistics, controls, or justification for this is shown. BTW, on Figure 3, what is the event at 200s?

      This comment is no longer relevant based on the changes we’ve made to the manuscript. 

      I'm confused. On Figure 4, the number of trials seems to go up to 50, but in the methods, they say that rats received 40 trials or 45 minutes of experience.

      This comment is no longer relevant based on the changes we’ve made to the manuscript. 

      At the end of page 14, the authors state that the strength of the correlation did not differ by group and that this was "predicted" by the RL modeling, but this statement is nonsensical, given that the RL modeling did not fit the data well, depended on extreme values. Moreover, this claim is dependent on "not statistically detectable", which is, of course, not interpretable as "not different".

      This comment is no longer relevant based on the changes we’ve made to the manuscript. 

      There is an interesting result on page 16 that the increases in theta power were observed before a delayed lever press but not an immediate lever press, and then that the theta power declined after an immediate lever press. 

      Thank you for the positive comment. 

      These data are separated by session group (again group 1 is a subset of the 4s sessions, group 2 is a subset of the 8s sessions, and group 3 is ignored). I would much rather see these data analyzed by delay itself or by some sort of strategy fit across delays.

      Thank you for the excellent suggestion. Our new group assignments take delay into account. 

      That being said, I don't see how this description shows up in Figure 6. What does Figure 6 look like if you just separate the sessions by delay?

      We are unclear what the reviewer means by “this description”.  

      Discussion:

      Finally, it is unclear to what extent this task actually gets at the questions originally laid out in the goals and returned to in the discussion. The idea of cognitive effort is interesting, but there is no data presented that this task is cognitive at all. The idea of a resourced cognitive effort and a resistance cognitive effort is interesting, but presumably the way one overcomes resistance is through resourcelimited components, so it is unclear that these two cognitive effort strategies are different.

      The basis for the reviewers assertation that “the way one overcomes resistance is through resourcelimited components” is not clear. In the revised version, we have taken greater care to outline how each type of effort signal facilitates performance of the task and articulate these possibilities in our stochastic and RL models. We view the strong evidence for ival tracking presented herein as a critical component of resource based cognitive effort. 

      The authors state that "ival-tracking" (neurons and ensembles that presumably track the number of pellets being delivered on the immediate lever - a fancy name for "expectations") "taps into a resourced-based form of cognitive effort", but no evidence is actually provided that keeping track of the expectation of reward on the immediate lever depends on attention or mnemonic resources. They also state that a "dLP-biased strategy" (waiting out the delay) is a "resistance-based form of cognitive effort" but no evidence is made that going to the delayed side takes effort.

      We challenge the reviewers that assertation ival tracking is a “fancy name for expectations”. We make no claim about the prospective or retrospective nature of the signal. Clearly, expectations should be prospective and therefore different from ival tracking. Regarding the resistance signal: First, animals avoid the delay lever more often at the 8 sec delay (Figure 1). We have shown that increasing the delay systematically biases responses AWAY from the delay (Linsenbardt et al., 2017). This is consistent with a well-developed literature that rats and mice do not like waiting for delayed reinforcers. We contend that enduring something you don’t like takes effort. 

      The authors talk about theta synchrony, but never actually measure theta synchrony, particularly across structures such as amygdala or ventral hippocampus. The authors try to connect this to "the unpleasantness of the delay", but provide no measures of pleasantness or unpleasantness. They have no evidence that waiting out an 8s delay is unpleasant.

      We have added spike-field coherence to better contact the literature on synchrony. Note that we never refer to our results as “synchrony”. However, we would be remiss to not address the growing literature on theta synchrony in effort allocation. There is a well-developed literature that rats and mice do not like waiting for delayed reinforcers. If waiting out the delay was not pleasant then why do the animals forgo larger rewards to avoid it? 

      The authors hypothesize that the "ival-tracking signal" (the expectation of number of pellets on the immediate lever) "could simply reflect the emotional or autonomic response". Aside from the fact that no evidence for this is provided, if this were to be true, then, in what sense would any of these signals be related to cognitive control?

      This is proposed as an alternative explanation to the ival signal in the discussion. It was added as our due diligence. Emotional state could provide feedback to the currently implemented control mechanism. If waiting for reinforcement is too unpleasant this could drive them to ival tracking and choosing the immediate option more frequently. We provide this option only as a possibility, not a conclusion. We have clarified this in the revised text. Nevertheless, based on our review of the literature, autonomic tracking in some form, seems to be the most likely function of ACC (Seamans & Floresco 2022). While the reviewer may disagree with this, we feel it is at least as valid as all the complex, cognitively-based interpretations that commonly appear in the literature.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript explores the neuronal signals that underlie resistance vs resource-based models of cognitive effort. The authors use a delayed discounting task and computational models to explore these ideas. The authors find that the ACC strongly tracks value and time, which is consistent with prior work. Novel contributions include quantification of a resource-based control signal among ACC ensembles, and linking ACC theta oscillations to a resistance-based strategy.

      Strengths:

      The experiments and analyses are well done and have the potential to generate an elegant explanatory framework for ACC neuronal activity. The inclusion of local-field potential / spike-field analyses is particularly important because these can be measured in humans.

      Thank you for the endorsement of our work.

      Weaknesses:

      I had questions that might help me understand the task and details of neuronal analyses.

      (1) The abstract, discussion, and introduction set up an opposition between resource and resistancebased forms of cognitive effort. It's clear that the authors find evidence for each (ACC ensembles = resource, theta=resistance?) but I'm not sure where the data fall on this dichotomy.

      (a) An overall very simple schematic early in the paper (prior to the MCML model? or even the behavior) may help illustrate the main point.

      (b) In the intro, results, and discussion, it may help to relate each point to this dichotomy.

      (c) What would resource-based signals look like? What would resistance based signals look like? Is the main point that resistance-based strategies dominate when delays are short, but resource-based strategies dominate when delays are long?

      (d) I wonder if these strategies can be illustrated? Could these two measures (dLP vs ival tracking) be plotted on separate axes or extremes, and behavior, neuronal data, LFP, and spectral relationships be shown on these axes? I think Figure 2 is working towards this. Could these be shown for each delay length? This way, as the evidence from behavior, model, single neurons, ensembles, and theta is presented, it can be related to this framework, and the reader can organize the findings.

      These are excellent suggestions, and we have implemented them, where possible. 

      (2) The task is not clear to me.

      (a) I wonder if a task schematic and a flow chart of training would help readers.

      Yes, excellent idea, we have now included this in Figure 1. 

      (b) This task appears to be relatively new. Has it been used before in rats (Oberlin and Grahame is a mouse study)? Some history / context might help orient readers.

      Indeed, this task has been used in rats in several prior studies in rats. Please see the following references (PMID: 39119916, 31654652, 28000083, 26779747, 12270518, 19389183).

      (c) How many total sessions were completed with ascending delays? Was there criteria for surgeries? How many total recording sessions per animal (of the 54?)

      Please note that the delay does not change within a session. There were no criteria for surgery. 

      (d) How many trials completed per session (40 trials OR 45 minutes)? Where are there errors? These details are important for interpreting Figure 1.

      Every animal in this data set completed 40 trials and we have updated the task description to clarify this issue. There are no errors in this task, but rather the task is designed to the tendency to make an impulsive choice (smaller reward now). 

      (3) Figure 1 is unclear to me.

      (a) Delayed vs immediate lever presses are being plotted - but I am not sure what is red, and what is blue. I might suggest plotting each animal.

      We have updated Figure 1 considerably for clarity. 

      (b) How many animals and sessions go into each data point?

      We hope this is clarified now with our new group assignments as all sessions were included in the analysis. 

      (c) Table 1 (which might be better referenced in the paper) refers to rats by session. Is it true that some rats (2 and 8) were not analyzed for the bulk of the paper? Some rats appear to switch strategies, and some stay in one strategy. How many neurons come from each rat?

      We have updated Table 1 based on our new groupings. The rats that contribute the most sessions also tend to be represented across the behavioral groups therefore it is unlikely that effort allocation strategies across groupings are an esoteric feature of an animal. 

      (d) Task basics - RT, choice, accuracy, video stills - might help readers understand what is going into these plots

      (e) Does the animal move differently (i.e., RTs) in G1 vs. G2?

      Excellent suggestion. We have added more analysis of the task variables in the revision (e.g. RT, choice comparisons across delays, etc…)

      (4) I wasn't sure how clustered G1 vs. G2 vs G3 are. To make this argument, the raw data (or some axis of it) might help.

      (a) This is particularly important because G3 appears to be a mix of G1 and G2, although upon inspection, I'm not sure how different they really are

      (b) Was there some objective clustering criteria that defined the clusters?

      (c) Why discuss G3 at all? Can these sessions be removed from analysis?

      Based on our updates to the behavioral analysis these comments are no longer relevant. 

      (5) The same applies to neuronal analyses in Fig 3 and 4

      (a) What does a single neuron peri-event raster look like? I would include several of these.

      (b) What does PC1, 2 and 3 look like for G1, G2, and G3?

      (c) Certain PCs are selected, but I'm not sure how they were selected - was there a criteria used? How was the correlation between PCA and ival selected? What about PCs that don't correlate with ival?

      (d) If the authors are using PCA, then scree plots and PETHs might be useful, as well as comparisons to PCs from time-shuffled / randomized data.

      We hope that our reworking of the neural data analysis has clarified these issues. We now include several firing rate examples and aggregate data.   

      (6) I had questions about the spectral analysis

      (a) Theta has many definitions - why did the authors use 6-12 Hz? Does it come from the hippocampal literature, and is this the best definition of theta? What about other bands (delta - 1-4 Hz), theta (4-7 Hz); and beta - 13- 30 Hz? These bands are of particular importance because they have been associated with errors, dopamine, and are abnormal in schizophrenia and Parkinson's disease.

      This designation comes mainly from the hippocampal and ACC literature in rodents. In addition, this range best captured the peak in the power spectrum in our data. Note that we focus our analysis on theta give the literature regarding theta in the ACC as a correlate of cognitive controls (references in manuscript). We did interrogate other bands as a sanity check and the results were mostly limited to theta. Given the scope of our manuscript and the concerns raised regarding complexity we are concerned that adding frequency analyses beyond theta obfuscates the take home message.

      However, the spectrograms in Figure 3 show a range of frequencies and highlight the ones in the theta band as the most dynamic prior to the choice. 

      (b) Power spectra and time-frequency analyses may justify the authors focus. I would show these (yaxis - frequency, x-axis - time, z-axis, power).

      Thank you for the suggestion. We have added this to Figure 3.    

      (7) PC3 as an autocorrelation doesn't seem the to be right way to infer theta entrainment or spikefield relationships, as PCA can be vulnerable to phantom oscillations, and coherence can be transient. It is also difficult to compare to traditional measures of phase-locking. Why not simply use spike-field coherence? This is particularly important with reference to the human literature, which the authors invoke.

      Excellent suggestion. Note that PCA provided a way to classify neurons that exhibited peaks in the autocorrelation at theta frequencies. We have added spike-field coherence, and this analysis confirms the differences in theta entrainment of the spike trains across the behavioral groups. Please see Figure 6D.   

      Reviewer #3 (Public Review):

      Summary:

      The study investigated decision making in rats choosing between small immediate rewards and larger delayed rewards, in a task design where the size of the immediate rewards decreased when this option was chosen and increased when it was not chosen. The authors conceptualise this task as involving two different types of cognitive effort; 'resistance-based' effort putatively needed to resist the smaller immediate reward, and 'resource-based' effort needed to track the changing value of the immediate reward option. They argue based on analyses of the behaviour, and computational modelling, that rats use different strategies in different sessions, with one strategy in which they consistently choose the delayed reward option irrespective of the current immediate reward size, and another strategy in which they preferentially choose the immediate reward option when the immediate reward size is large, and the delayed reward option when the immediate reward size is small. The authors recorded neural activity in anterior cingulate cortex (ACC) and argue that ACC neurons track the value of the immediate reward option irrespective of the strategy the rats are using. They further argue that the strategy the rats are using modulates their estimated value of the immediate reward option, and that oscillatory activity in the 6-12Hz theta band occurs when subjects use the 'resistancebased' strategy of choosing the delayed option irrespective of the current value of the immediate reward option. If solid, these findings will be of interest to researchers working on cognitive control and ACCs involvement in decision making. However, there are some issues with the experiment design, reporting, modelling and analysis which currently preclude high confidence in the validity of the conclusions.

      Strengths:

      The behavioural task used is interesting and the recording methods should enable the collection of good quality single unit and LFP electrophysiology data. The authors recorded from a sizable sample of subjects for this type of study. The approach of splitting the data into sessions where subjects used different strategies and then examining the neural correlates of each is in principle interesting, though I have some reservations about the strength of evidence for the existence of multiple strategies.

      Thank you for the positive comments. 

      Weaknesses:

      The dataset is very unbalanced in terms of both the number of sessions contributed by each subject, and their distribution across the different putative behavioural strategies (see table 1), with some subjects contributing 9 or 10 sessions and others only one session, and it is not clear from the text why this is the case. Further, only 3 subjects contribute any sessions to one of the behavioural strategies, while 7 contribute data to the other such that apparent differences in brain activity between the two strategies could in fact reflect differences between subjects, which could arise due to e.g. differences in electrode placement. To firm up the conclusion that neural activity is different in sessions where different strategies are thought to be employed, it would be important to account for potential cross-subject variation in the data. The current statistical methods don't do this as they all assume fixed effects (e.g. using trials or neurons as the experimental unit and ignoring which subject the neuron/trial came from).

      In the revised manuscript we have updated the group assignments. We have improved our description of the logic and methods for employing these groupings as well. With this new approach, all sessions are now included in the analysis. The group assignments are made purely on the behavioral statistics of an animal in each session. We feel this approach is preferable to eliminating neurons or session with the goal of balancing them, which may introduce bias. Further, the rats that contribute the most sessions also tend to be represented across the behavioral groups therefore it is unlikely that effort allocation strategies across groupings are an esoteric feature of an animal. As neurons are randomly sampled from each animal on a given session, we feel that we’re justified in treating these as fixed effects.   

      It is not obvious that the differences in behaviour between the sessions characterised as using the 'G1' and 'G2' strategies actually imply the use of different strategies, because the behavioural task was different in these sessions, with a shorter wait (4 seconds vs 8 seconds) for the delayed reward in the G1 strategy sessions where the subjects consistently preferred the delayed reward irrespective of the current immediate reward size. Therefore the differences in behaviour could be driven by difference in the task (i.e. external world) rather than a difference in strategy (internal to the subject). It seems plausible that the higher value of the delayed reward option when the delay is shorter could account for the high probability of choosing this option irrespective of the current value of the immediate reward option, without appealing to the subjects using a different strategy.

      Further, even if the differences in behaviour do reflect different behavioural strategies, it is not obvious that these correspond to allocation of different types of cognitive effort. For example, subjects' failure to modify their choice probabilities to track the changing value of the immediate reward option might be due simply to valuing the delayed reward option higher, rather than not allocating cognitive effort to tracking immediate option value (indeed this is suggested by the neural data). Conversely, if the rats assign higher value to the delayed reward option in the G1 sessions, it is not obvious that choosing it requires overcoming 'resistance' through cognitive effort.

      The RL modelling used to characterise the subject's behavioural strategies made some unusual and arguably implausible assumptions:

      Thank you for the feedback, based on these comments (and those above) we have completely reworked the RL model. In addition, we’ve taken care to separate out the variables that correspond to a resistance- versus a resource-based signal. 

      There were also some issues with the analyses of neural data which preclude strong confidence in their conclusions:

      Figure 4I makes the striking claim that ACC neurons track the value of the immediately rewarding option equally accurately in sessions where two putative behavioural strategies were used, despite the behaviour being insensitive to this variable in the G1 strategy sessions. The analysis quantifies the strength of correlation between a component of the activity extracted using a decoding analysis and the value of the immediate reward option. However, as far as I could see this analysis was not done in a cross-validated manner (i.e. evaluating the correlation strength on test data that was not used for either training the MCML model or selecting which component to use for the correlation). As such, the chance level correlation will certainly be greater than 0, and it is not clear whether the observed correlations are greater than expected by chance.

      We have added more rigorous methods to assess the ival tracking signal (Figure 4 and 5). In addition, we’ve dropped the claim that ival tracking is the same across the behavioral groups. We suspect that this was an artifact of a suboptimal group assignment approach in the previous version. 

      An additional caveat with the claim that ACC is tracking the value of the immediate reward option is that this value likely correlates with other behavioural variables, notably the current choice and recent choice history, that may be encoded in ACC. Encoding analyses (e.g. using linear regression to predict neural activity from behavioural variables) could allow quantification of the variance in ACC activity uniquely explained by option values after controlling for possible influence of other variables such as choice history (e.g. using a coefficient of partial determination).

      We agree that the ival tracking signal may be influenced by other variables – especially ones that are not cognitive but rather more generated by the autonomic system. We have included a discussion of this possibility in the Discussion section. Our previous work has explored the role of choice history on neural activity, please see White et al., (2024). 

      Figure 5 argues that there are systematic differences in how ACC neurons represent the value of the immediate option (ival) in the G1 and G2 strategy sessions. This is interesting if true, but it appears possible that the effect is an artefact of the different distribution of option values between the two session types. Specifically, due to the way that ival is updated based on the subjects' choices, in G1 sessions where the subjects are mostly choosing the delayed option, ival will on average be higher than in G2 sessions where they are choosing the immediate option more often. The relative number of high, medium and low ival trials in the G1 and G2 sessions will therefore be different, which could drive systematic differences in the regression fit in the absence of real differences in the activity-value relationship. I have created an ipython notebook illustrating this, available at: https://notebooksharing.space/view/a3c4504aebe7ad3f075aafaabaf93102f2a28f8c189ab9176d48 07cf1565f4e3. To verify that this is not driving the effect it would be important to balance the number of trials at each ival level across sessions (e.g. by subsampling trials) before running the regression.

      This is an excellent point and lead us to abandon the linear regression-based approach to quantify differences in ival coding across behavioral groups.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      This paper was extremely hard to read. In addition to the issues raised in the public review (overly complex and incomplete analyses), one of the hardest things to deal with was the writing.

      Thank you for the feedback. Hopefully we have addressed this with our thorough rewrite. 

      The presentation was extremely hard to follow. I had to read through it several times to figure out what the task was. It wasn't until I got to the RL model Figure 2A that I realized what was really going on with the task. I strongly recommend having an initial figure that lays out the actual task (without any RL or modeling assumptions) and identifies the multiple different kinds of sessions. What is the actual data you have to start with? That was very unclear.

      Excellent idea. We have implemented this in Figure 1.  

      Labeling session by "group" is very confusing. I think most readers take "group" as the group of subjects, but that's not what you mean at all. You mean some sessions were one way and some were another. (And, as I noted in the public review, you ignore many of the sessions, which I think is not OK.) I think a major rewrite would help a lot. Also, I don't think the group analysis is necessary at all. In the public review, I recommend doing the analyses very differently and more classically.

      We have updated the group assignments in a manner that is more intuitive, reflects the delays, and includes all sessions.  

      The paper is full of arbitrary abbreviations that are completely unnecessary. Every time I came to "ival", I had to translate that into "number of pellets delivered on the immediate lever" and every time I came to dLP, I had to translate that into "delayed lever press". Making the text shorter does not make the text easier to read. In general, I was taught that unless the abbreviation is the common term (such as "DNA" not "deoxyribonucleic acid"), you should never use an abbreviation. While there are some edge cases (ACC probably over "anterior cingulate cortex"), dLP, iLP, dLPs, iLPs, ival, are definitely way over the "don't do that" line.

      We completely agree here and apologize for the excessive use of abbreviations. We have removed nearly all of them

      The figures were incomplete, poorly labeled, and hard to read. A lot of figures were missing, for example

      Basic task structure

      Basic behavior on the task

      Scatter plot of the measures that you are clustering (lever press choice X number of pellets on the immediate lever, you can use color or multiple panels to indicate the delay to the delayed lever) Figure 3 is just a couple of examples. That isn't convincing at all.

      Figure 4 is missing labels. In Figure 4, I don't understand what you are trying to say.

      I don't see how the results on page 16 arise from Figure 6. I strongly recommend starting from the actual data and working your way to what it means rather than forcing this into this unreasonable "session group" analysis.

      We have completely reworked the Figures for clarity and content. 

      The statement that "no prior study has explored the cellular correlates of cognitive effort" is ludicrous and insulting. There are dozens of experiments looking at ACC in cognitive effort tasks, in humans, other primates, and rodents. There are many dozens of experiments looking at cellular correlates in intertemporal choice tasks, some with neural manipulations, some with ensemble recordings. There are many dozens of experiments looking at cellular relationships to waiting out a delay.

      We agree that our statement was extremely imprecise. We have updated this to say:  “Further, a role for theta oscillations in allocating physical effort has been identified. However, the cellular

      mechanisms within the ACC that control and deploy types of cognitive effort have not been identified.”

      Reviewer #2 (Recommendations For The Authors):

      In Figure 2, the panels below E and F are referred to as 'right' - but they are below? I would give them letters.

      I would make sure that animal #s, neuron #s, and LFP#s are clearly presented in the results and in each figure legend. This is important to follow the results throughout the manuscript.

      Some additional proofreading ('Fronotmedial') might help with clarity.

      Based on our updates, this is no longer relevant.  

      Reviewer #3 (Recommendations For The Authors):

      In addition to the suggestions above to address specific issues, it would be useful to report some additional information about aspects of the experiments and analyses:

      Specify how spike sorting was performed and what metrics were used to select well isolated single units.

      Done.

      Provide histology showing the recording locations for each subject.

      Histological assessments of electrodes placements are provided in White et al. 2024, but we provide an example placement. This has been added to the text. 

      Indicate the sequence of recording sessions that occurred for each subject, including for each session what delay duration was used and which dataset the session contributed to, and indicate when the neural probes were advanced between sessions.

      We feel that this adds complexity unnecessarily as we make no claims about holding units across sessions for differences in coding in the dorsoventral gradient of ACC. 

      Indicate the experimental unit when reporting uncertainty measures in figure legends (e.g. mean +/- SEM across sessions).

      Done.

    1. Author response:

      Before providing a brief provisional response to the two reviews, it is important to reiterate a few key points about our work. First, our paper is largely a computational biophysics paper, augmented by experimental results. Generally speaking, computational biophysics work intends to achieve one of two things (or both). One is to provide more molecular level insight into various behaviors of biomolecular systems that have not been (or cannot be) provided by qualitative experimental results alone. The second general goal of computational biophysics it to formulate new hypotheses to be tested subsequently by experiment. In our paper, we have achieved both of these goals and then confirmed the key computational results by experiment..

      The first reviewer has some valuable points, which can be addressed as follows (and will be emphasized in the revised version of the paper): (1) Yes the simulations of capsid rupture in the NPC and capsid-only are directly comparable as both have approximately the same number of bound LEN, as determined by following the LEN-capsid interaction protocol described in the main text (around Fig 6) and in the SI section S3; (2) While we have stressed this point in several places in the manuscript, here again we stress that coarse-grained (CG) MD time is not the same as real time. The point of CG simulations is to accelerate the timescale of the MD and the associated sampling, so the CG “time” from the MD integrator needs to be rescaled to associate a real time to it. As such, our CG simulation is not representing a microsecond of real time but rather something much longer. We will emphasize this again in the revised text. (3) Actually, we think that the parameterization of the LEN model and the LEN-capsid interactions is well described in the text associated with Fig 6 and in SI section S3. It is true that this one part of the CG model was parameterized “top-down” given the good experimental structures of bound LEN to capsid and other data, but the rest of the CG model is “bottom-up” (meaning developed from well-defined coarse-graining statistical mechanics as applied to molecular level structures and interactions, see also below). 

      As for the second reviewer, this review is quite problematic in our view as the reviewer seems to think that quoting a number of qualitative experimental results is sufficient to undermine the impact of our paper (they are not) and, furthermore, the reviewer appears to have a very minimal understanding of “bottom-up” CG modeling, which we have utilized. This modeling does not in fact rely on the “assumptions” this reviewer alleges we have relied on. (As an aside, it could be helpful for this reviewer to study the review by Jin et al, https://doi.org/10.1021/acs.jctc.2c00643) in order to become more familiar with the field and our approach before criticizing it.) We also note that our main HIV capsid-NPC docking model is already published in PNAS (https://doi.org/10.1073/pnas.2313737121), where it underwent rigorous peer review. In our forthcoming full response to the reviews and in the revised paper we will attempt to address a number of this reviewers comments, but the number, extent, and tone of this collection of criticisms, for us, calls into question the objectivity of this reviewer, not to mention the reviewer’s rather weak understanding of what we have done and how we have done it.

      Finally, while we certainly appreciate the overall positive eLife assessment, we are disappointed by the statement “some mechanistic interpretations rely on assumptions embedded in the simulations, leaving parts of the evidence incomplete”. Of course, all simulations (and experiments) rely on certain assumptions, but we have gone to great length to provide a “bottomup” approach to our modeling, based on underlying molecular level structures and interactions, and we have provided experimental validation of the main simulation predictions. It seems that the comments of the second reviewer may have influenced this point of view, but we do not feel it is justified.

    1. Author response:

      Reviewer #1 (Public review):

      This manuscript provides several important findings that advance our current knowledge about the function of the gustatory cortex (GC). The authors used high-density electrophysiology to record neural activity during a sucrose/NaCl mixture discrimination task. They observed population-based activity capable of representing different mixtures in a linear fashion during the initial stimulus sampling period, as well as representing the behavioral decision (i.e., lick left or right) at a later time point. Analyzing this data at the single neuron level, they observed functional subpopulations capable of encoding the specific mixture (e.g., 45/55), tastant (e.g., sucrose), and behavioral choice (e.g., lick left). To test the functional consequences of these subpopulations, they built a recurrent neural network model in order to "silence" specific functional subpopulations of GC neurons. The virtual ablation of these functional subpopulations altered virtual behavioral performance in a manner predicted by the subpopulation's presumed contribution.

      Strengths:

      Building a recurrent neural network model of the gustatory cortex allows the impact of the temporal sequence of functionally identifiable populations of neurons to be tested in a manner not otherwise possible. Specifically, the author's model links neural activity at the single neuron and population level with perceptual ability. The electrophysiology methods and analyses used to shape the network model are appropriate. Overall, the conclusions of the manuscript are well supported.

      Weaknesses:

      One potential concern is the apparent mismatch between the neural and behavioral data. Neural analyses indicate a clear separation of the activity associated with each mixture that is independent of the animal's ultimate choice. This would seemingly indicate that the animals are making errors despite correctly encoding the stimulus. Based solely on the neural data, one would expect the psychometric curve to be more "step-like" with a significantly steeper slope. One potential explanation for this observation is the concentration of the stimuli utilized in the mixture discrimination task. The authors utilize equivalent concentrations, rather than intensity-matched concentrations. In this case, a single stimulus can (theoretically) dominate the perception of a mixture, resulting in a biased behavioral response despite accurate concentration coding at the single neuron level. Given the difficulty of isointensity matching concentrations, this concern is not paramount. However, the apparent mismatch between the neural and behavioral data should be acknowledged/addressed in the text.

      We thank the Reviewer for the insightful comments and thoughtful suggestions. Our electrophysiological recordings show that GC dynamically encodes stimulus concentration of mixture elements, dominant perceptual quality, and decisions of directional lick. With regard to the encoding of mixtures, the clear separation of activity associated with each mixture (Figure 3) is present at a trial-averaged pseudo-population level, and average activities associated with more similar, intermediate mixtures are closer to each other in this space. In fact, at a single trial level activity evoked by similar, intermediate mixtures can be hard to separate. This increased similarity can lead to behavioral errors resulting from either incorrect encoding of the stimulus or from the inability to interpret the stimuli to guide the correct decision.

      The psychometric function, which shows that more distinct stimuli (100/0 vs 0/100) lead to fewer mistakes than more ambiguous, intermediate mixtures (55/45 vs 55/45), is consistent with the increased ambiguity of responses to intermediate mixtures and with the possibility that, compared to pure stimuli, intermediate mixtures lead to more trials in which the binary choice component of neural activity is inverted, resulting in more directional errors.

      The Reviewer is correct that there could be a slight mismatch in the perceived intensity of the mixture components. This mismatch could be the reason for the slight asymmetry in our psychometric function (Figure 1B). However, it is not uncommon for mice in these 2AC tasks to also have a motor laterality bias in their responses that manifests itself for the more ambiguous stimuli. We chose not to model this bias given its subtlety and its unknown origin. Rather, we chose to model an ideal scenario in which stimuli have matched intensity and no motor bias exists. In the revised version we will discuss this issue.

      Reviewer #2 (Public review):

      Lang et al. investigate the contribution of individual neuronal encoding of specific task features to population dynamics and behavior. Using a taste-based decision-making behavioral task with electrophysiology from the mouse gustatory cortex and computational modeling, the authors reveal that neurons encoding sensory, perceptual, and decision-related information with linear and categorical patterns are essential for driving neural population dynamics and behavioral performance. Their findings suggest that individual linear and categorical coding units have a significant role in cortical dynamics and perceptual decision-making behavior.

      Overall, the experimental and analytical work is of very high quality, and the findings are of great interest to the taste coding field, as well as to the broader systems neuroscience field.

      I have a couple of suggestions to further enhance the authors' important conclusions:

      My main comment is the distinction between constrained and unconstrained units. The authors train a small percentage of units to match the real neural data (constrained units), and then find some unconstrained units that are similar to the real neural data and some that are not. As far as I could tell, the relative fraction of constrained and unconstrained units in the trained RNN is not reported; I assume the constrained ones are a much smaller population, but this is unclear. The selection of different groups of neurons for the RNN ablation experiments appears to be based on their response profiles only. Therefore, if I understood correctly, both constrained and unconstrained units and ablated together for a given response category (e.g., linear or step-perception). It would be useful, therefore, to separately compare the effects of constrained vs. unconstrained RNN units.

      We thank the Reviewer for the constructive feedback and are pleased that the work is considered of broad interest. The Reviewer is correct that ablations were carried out with respect to response categories only and included both constrained and unconstrained units.

      The ratio of total units to constrained units is fixed at 5.88, thus constrained units are ~17% of the network and unconstrained units are ~83%. This value is specified in the Methods (RNN: Components and dynamics), but we will report it in the Results of the revised manuscript as well for clarity.

      Specifically:

      (1) For the analyses in the initial version of the manuscript, the authors should specify how many units in each ablation category are constrained and unconstrained.

      In the revised manuscript, we will specify the fractions of constrained and unconstrained units within each response category. For convenience, they are reported here: Linear = 194 constrained and 691 unconstrained units; Step-perception = 147 constrained and 840 unconstrained units; Step-choice = 129 constrained and 814 unconstrained units; Other = 353 constrained and 1739 unconstrained units.

      (2) The authors should repeat Figure 6, but only for unconstrained units to test how much of the effects in the initial version of Figure 6 are driven by constrained vs. unconstrained RNN units.

      In the revised version we will add a Supplemental Figure in which the contribution of constrained vs unconstrained units is addressed.

      (3) The authors should repeat Figure 7, but performing ablations separately on the constrained and unconstrained units to examine how the network behaves in each case and the resulting "behavioral" effect.

      The revised version will include a Supplemental Figure with these simulations.

      Reviewer #3 (Public review):

      Primary taste cortex neurons show a variety of dynamic response profiles during taste decision-making tasks, reflecting both sensory and decision variables. In the present study, Lang et al. set out to determine how neurons with distinct response profiles contribute to perceptual decisions about taste stimuli.

      The methods,with reference to the behavioral task and electrophysiological recordings/data analysis, are straightforward, solid, and appropriate. The computational model is presented in a clear and conceptually intuitive manner, although the details are outside of my area of expertise.

      The experimental design features a simple 2-alternative forced-choice design that yielded clear psychometric curves across a range of stimuli. In vivo recordings were performed using Neuropixels and yielded an appropriate sample of single neuron responses. The strength of the model lies in the fact that it consists of single neurons whose response profiles mimic those recorded in vivo, and allows neuron-selective manipulation.By virtually lesioning specific subsets of neurons in the network, the authors demonstrate that a relatively small population of neurons with specific tuning profiles was sufficient to produce the observed neural dynamics and behavioral responses. This effect was selective as lesioning other responsive neurons did not affect overall response dynamics or performance.These findings provide new insight into the relation between the response profiles of single neurons in sensory cortex, their population-level activity dynamics, and the perceptual decisions they inform.

      The approach is particularly innovative as it uses computational modeling to target functionally-defined "cell types", which cannot necessarily be targeted by more conventional genetic approaches.

      We thank the Reviewer for the positive assessment of our study.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this paper, the authors develop a biologically plausible recurrent neural network model to explain how the hippocampus generates and uses barcode-like activity to support episodic memory. They address key questions raised by recent experimental findings: how barcodes are generated, how they interact with memory content (such as place and seed-related activity), and how the hippocampus balances memory specificity with flexible recall. The authors demonstrate that chaotic dynamics in a recurrent neural network can produce barcodes that reduce memory interference, complement place tuning, and enable context-dependent memory retrieval, while aligning their model with observed hippocampal activity during caching and retrieval in chickadees.

      Strengths:

      (1) The manuscript is well-written and structured.

      (2) The paper provides a detailed and biologically plausible mechanism for generating and utilizing barcode activity through chaotic dynamics in a recurrent neural network. This mechanism effectively explains how barcodes reduce memory interference, complement place tuning, and enable flexible, context-dependent recall.

      (3) The authors successfully reproduce key experimental findings on hippocampal barcode activity from chickadee studies, including the distinct correlations observed during caching, retrieval, and visits.

      (4) Overall, the study addresses a somewhat puzzling question about how memory indices and content signals coexist and interact in the same hippocampal population. By proposing a unified model, it provides significant conceptual clarity.

      Weaknesses:

      The recurrent neural network model incorporates assumptions and mechanisms, such as the modulation of recurrent input strength, whose biological underpinnings remain unclear. The authors acknowledge some of these limitations thoughtfully, offering plausible mechanisms and discussing their implications in depth.

      One thread of questions that authors may want to further explore is related to the chaotic nature of activity that generates barcodes when recurrence is strong. Chaos inherently implies sensitivity to initial conditions and noise, which raises questions about its reliability as a mechanism for producing robust and repeatable barcode signals. How sensitive are the results to noise in both the dynamics and the input signals? Does this sensitivity affect the stability of the generated barcodes and place fields, potentially disrupting their functional roles? Moreover, does the implemented plasticity mitigate some of this chaos, or might it amplify it under certain conditions? Clarifying these aspects could strengthen the argument for the robustness of the proposed mechanism.

      In our model, chaos is used to produce a random barcode when forming memories, but memory retrieval depends on attractor dynamics. Specifically, the plasticity update at the end of the cache creates an attractor state, and then afterwards for successful memory retrieval the network activity must settle into this attractor rather than remaining chaotic. This attractor state is a conjunction of memory content (place and seed activity) and memory index (barcode activity). Thus a barcode is ‘reactivated’ when network dynamics during retrieval settle into this cache attractor, or in other words chaotic dynamics do not need to generate the same barcode twice.

      The reviewer raises an important point, which is how sensitivity to initial conditions and noise would affect the reliability of our proposed mechanism. The key question here is how noise will affect the network’s dynamics during retrieval. Would adding noise to the dynamics make memory retrieval more difficult? We thank the reviewer for suggesting we investigate this further, and below describe our experiments and changes to the manuscript to better address this topic.

      We first experimented with adding independent gaussian distributed noise into each unit, drawn independently at each timestep. We analyzed recall accuracy using the same task and methods as Fig. 4F while varying the magnitude of noise. Memory recall was quite robust to this form of noise, even as the magnitude of noise approached half of the signal amplitude. This first experiment added noise into the temporal dynamics of the network. We subsequently examined adding static noise into the network inputs, which can also be thought of as introducing noise into initial conditions. Specifically, we added independent gaussian distributed noise into each unit, with the random value held constant for the extent of temporal dynamics. This perturbation decreased the likelihood of memory recall in a graded manner with noise magnitude, without dramatically changing the spatial profile. Examination of dynamics on individual trials revealed that the network failed to converge onto a cache attractor on some random fraction of trials, with other trials appearing nearly identical to noiseless results. We now include these results in the text and as a new supplementary figure, Figure S4AB.

      To clarify the network dynamics and the purpose of chaos in our model, we make the following modifications in text:

      Section 2.3, paragraph 2 (starting at “To store memories…”):

      “…place inputs arrive into the RNN, recurrent dynamics generate an essentially random barcode, seed inputs are activated, and then Hebbian learning binds a particular pattern of barcode activity to place- and seed-related activity.”

      Section 2.3, paragraph 3 (starting at “Memory recall in our network…”): As an example, consider a scenario in which an animal has already formed a memory at some location l, resulting in the storage of an attractor \vec{a} into the RNN. The attractor \vec{a} can be thought of as a linear combination of place input-driven activity $p(l)$, seed input-driven activity $s$, and a recurrent-driven barcode component $b$. Later, the animal returns to the same location and attempts recall (i.e. sets r \= 1, Figure 3B). Place inputs for location l drive RNN activity towards $p(l)$, which is partially correlated with attractor \vec{a}, and the recurrent dynamics cause network activity to converge onto attractor \vec{a}. In this way, barcode activity $b$ is reactivated, along with the place and seed components stored in the attractor state, $p(l)$ and $s$. The seed input can also affect recall, as discussed in the following section.

      Section 2.4, final paragraph (starting “We further examined how model hyperparameters affected performance on these tasks”), added the following describing new results on adding noise: We found that adding noise to the network's temporal dynamics had little effect on memory recall performance (Figure S4A). However, large static noise vectors added to the network's input and initial state decreased the overall probability of memory recall, but not its spatial profile (Figure S4B).

      It may also be worth exploring the robustness of the results to certain modeling assumptions.  For instance, the choice to run the network for a fixed amount of time and then use the activity  at the end for plasticity could be relaxed.

      As described above, chaotic dynamics are necessary to generate a barcode during a cache, but not to reactivate that barcode during retrieval. During a successful memory retrieval, network activity settles into an attractor state and thus does not depend on the duration of simulated dynamics. The choice of duration to run dynamics during caching is important, but only insofar as activity significantly decorrelates from the initial state. We show in Figure S1B that decorrelation saturates ~t=25, and thus any random time point t > 25 would be similarly effective. We used a fixed duration runtime for caches only to avoid introducing unnecessary complication into our model.

      Reviewer #2 (Public review):

      Summary:

      Striking experimental results by Chettih et al 2024 have identified high-dimensional, sparse patterns of activity in the chickadee hippocampus when birds store or retrieve food at a given site. These barcode-like patterns were interpreted as "indexes" allowing the birds to retrieve from memory the locations of stored food.

      The present manuscript proposes a recurrent network model that generates such barcode activity and uses it to form attractor-like memories that bind information about location and food. The manuscript then examines the computational role of barcode activity in the model by simulating two behavioral tasks, and by comparing the model with an alternate model in which barcode activity is ablated.

      Strengths of the study:

      Proposes a potential neural implementation for the indexing theory of episodic memory - Provides a mechanistic model of striking experimental findings: barcode-like, sparse patterns of activity when birds store a grain at a specific location

      A particularly interesting aspect of the model is that it proposes a mechanism for binding discrete events to a continuous spatial map, and demonstrates the computational advantages of this mechanism.

      Weaknesses:

      The relation between the model and experimentally recorded activity needs some clarification

      The relation with indexing theory could be made more clear

      The importance of different modeling ingredients and dynamical mechanisms could be made more clear

      The paper would be strengthened by focusing on the most essential aspects

      Comments:

      The model distinguishes between "barcode activity" and "attractors". Which of the two corresponds to experimentally-recorded barcodes? I would presume the attractors. A potential issue is that the attractors are, as explained in the text (l.137), conjunctions of place activity, barcode activity and "seed" inputs. The fact that the seed activity is shared across attractors seems to imply that they have a non-zero correlation independent of distance. Is that the case in the model? If I understand correctly, Fig 3D shows correlations between an attractor and barcodes at different locations, but correlations between attractors at different locations are not shown. Fig 1 F instead shows that correlations between recorded retrieval activities decay to zero with distance.

      More generally, the fact that the expression "barcode" is apparently used with different meanings in the model and in the experiments is potentially confusing (in the model they correspond to activity generating during caching, and this activity is distinct from the memories; my understanding is that in the experiments barcodes correspond to both caching and retrieval, but perhaps I am mistaken?).

      Our intent is to use the expression “barcode” as similarly as possible between model and experimental work. The reviewer points out that the connection between barcodes in experimental and modeling work is unclear, as well as the relation of “attractors” in our model to previous experimental results. The meaning of ‘barcode’ is absolutely critical—we clarify below our intended meaning, and then describe changes to the manuscript to highlight this.

      In experiments, we observed that activity during caching looked different than ordinary hippocampal activity (i.e. typical “place activity” observed during visits). Empirically there were two major differences. First, there was a pattern of neural activity which was present during every cache . This pattern was also present when birds visually inspected sites containing a cached seed, but not when visually inspecting an empty site. This is what we refer to as “seed activity”. Second, there was a pattern of neural activity which was unique to each cache. This pattern re-occurred during retrieval, and was orthogonal to place activity (see Fig. 1E-F). This is what we refer to as “barcode activity”. In summary, activity during a cache (or retrieval) contains a combination of three components: place activity, seed activity, and barcode activity.

      These experimental findings are recapitulated in our model, as activity during a cache contains a combination of three components: place activity driven by place inputs, seed activity driven by seed inputs, and barcode activity generated by recurrent dynamics. Cache activity in the model corresponds to cache activity in experiments, and barcodes in the model correspond to barcodes in experiments. Our model additionally has “attractors”, meaning that network connectivity changes so that the activity generated during a simulated cache becomes an attractor state of network dynamics. “Attractors” refers to a feature of network dynamics, not a distinct activity state, and we do not yet know if these attractors exist in experimental data.

      Figure 3D, as described in the figure legend, is a correlation of activity during cache and retrieval (in purple), for cache-retrieval pairs at the same or at different sites. We believe this is what the reviewer asks to see: the correlation between attractor states for different cache locations. The reviewer makes an important point: seed activity is shared across all attractors, so then why are correlations not high for all locations? This is because attractors also have a place component, which is anti-correlated for distant locations. This is evident in Fig. 3D by noticing that visit-visit correlations (black line, corresponding to place activity only) are negative for distant locations, and the correlation between attractors (purple line, cache-retrieval pairs) is subtly shifted up relative to the black line (place code only) for these distant locations. The size of this shift is due to the relative magnitude of place and seed inputs. For example, if we increase the strength of the seed input during caching (blue line), we can further increase the correlation between attractors even for quite distant sites:

      Author response image 1.

      To clarify the manuscript, we made the following modifications:

      Section 2.2, first paragraph: We model the hippocampus as a recurrent neural network (RNN) (Alvarez and Squire, 1994; Tsodyks, 1999; Hopfield, 1982) and propose that recurrent dynamics can generate barcodes from place inputs. As in experiments, the model’s population activity during a cache should exhibit both place and barcode activity components.

      Section 2.3, paragraph 3 (starting at “Memory recall in our network…”): As an example, consider a scenario in which an animal has already formed a memory at some location l , resulting in the storage of an attractor \vec{a} into the RNN . The attractor \vec{a} can be thought of as a linear combination of place input-driven activity $p(l)$, seed input-driven activity $s$, and a recurrent-driven barcode component $b$. Later, the animal returns to the same location and attempts recall (i.e. sets r \= 1, Figure 3B). Place inputs for l drive RNN activity towards $p(l)$, which is partially correlated with attractor \vec{a}, and the recurrent dynamics cause network activity to converge onto attractor \vec{a}. In this way, barcode activity $b$ is reactivated as part of attractor \vec{a}, along with the place and seed components stored in the attractor state, $p(l)$ and $s$. The seed input can also affect recall, as discussed in the following section.

      The insights obtained from the network model for the computational role of barcode activity could be explained more clearly. The introduction starts by laying out the indexing theory, which proposes that the hippocampus links an index with each memory so that the memory is reactivated when the index is presented. The experimental paper suggests that the barcode activations play the role of indexes. Yet, in the model reactivations of memories are driven not by presenting bar-code activity, but by presenting place activity (Cache Presence task) or seed activity (Cache Location task). So it seems that either place activity and seed activity play the role of indexes. Section 2.5 nicely shows that ultimately the role of barcode activity is to decorrelate attractors, which seems different from playing the role of indexes. I feel it would be useful that the Discussion reassess more critically the relationship between barcodes, indexing theory, and key-value architectures.

      The reviewer highlights a failure on our part to clearly identify the connection between our findings on barcodes, indexing theory, and key-value architectures. This is another major component of the paper, and below we propose changes to the manuscript to clarify these concepts and their relationships. First, we will summarize the key points that were unclear in our original manuscript.

      The reviewer equates the concept of an ‘index’ with that of a ‘query’: the signal that drives memory reactivation. This may be intuitive, but it is not how a memory index was defined in indexing theory (e.g. Teyler & DiScenna 1986). In indexing theory, the index is a pattern of hippocampal activity that is (a) generated during memory formation, (b) separate from the activity encoding memory content, and (c) linked to memory content via associative plasticity. After memory formation, a memory might be queried by activating a partial set of the memory contents, which would then drive reactivation of the hippocampal index, leading to pattern completion of memory contents. See, for example, figure 1 of Teyler and DiScenna 1986. The ‘index’ is thus not the same as the ‘query’ that drives recall.

      We propose in this work that barcode activity is such an index. Indexing theory originally posited that memory content was encoded by neocortex, and memory index was encoded by hippocampus. However the experiments of Chettih et al. 2024 revealed that the hippocampus contained both memory content and memory index signals, and furthermore there was no division of cells into ‘content’ and ‘index’ subtypes. Thus our model drops the assumption of earlier work that index and content signals correspond to different neurons in different brain areas—a significant advance of our work. Otherwise, the experimentally observed barcodes and the barcodes generated by our computational model play the role of indices as originally defined.

      Our original manuscript was unclear on the relationship of indexing theory and key-value systems. Our work connects diverse areas of memory models, including attractor dynamics, key-value memory systems, and memory indexing. A full account of these literatures and their relationships may be beyond the scope of this manuscript, and we note that a recent review article (Gershman, Fiete, and Irie, 2025) further clarifies the relationship between key-value memory, indexing theory, and the hippocampus. We will cite this work in our discussion as a source for the interested reader.

      Briefly, a key-value memory system distinguishes between the address where a memory is stored, the ‘key’, and the content of that memory, the ‘value’. An advantage of such systems is that keys can be optimized for purposes independent of the value of each memory. The use of barcodes in our model to decorrelate memories is related to this optimization of keys in key-value memory systems. By generating barcodes and adding this to the attractor state corresponding to a cache memory, the ‘address’ of the memory in population activity is differentiated from other memories. Our work is thus consistent with the idea that hippocampus generates keys and implements a key storage system. However it is not so straightforward to equate barcodes with keys, as they are defined in key-value memory. As the reviewer points out, memory recall can be driven by location and seed inputs, i.e. it is content-addressable. We think of the barcode as modifying the memory address to better separate similar memories, without changing memory content, and the resulting memory can be recalled by querying with either content or barcode. Given the complex and speculative nature of these relationships, we prefer to note the salient connection of our work with ongoing efforts applying the key-value framework to biological memory, and leave the precise details of this connection to future work.

      We make the following changes in the manuscript to clarify these ideas:

      Introduction, first paragraph: In this scheme, during memory formation the hippocampus generates an index of population activity, and the neurons representing this index are linked with the neurons representing memory content by associative plasticity . Later, re-experience of partial memory contents may reactivate the index, and reactivation of the index drives complete recall of the memory contents.

      Discussion, 4th paragraph on key-value: Interestingly, prior theoretical work has suggested neural implementations for both key-value memory and attention mechanisms, arguing for their usefulness in neural systems such as long term memory (Kanerva, 1988; Tyulmankov et al., 2021; Bricken and Pehlevan, 2021; Whittington et al., 2021; Kozachkov et al., 2023; Krotov and Hopfield, 2020; Gershman 2025 ). In this framework, the address where a memory is stored (the key) may be optimized independently of the value or content of the memory. In our model, barcodes improve memory performance by providing a content-independent scaffold that binds to memory content, preventing memories with overlapping content from blurring together. Thus barcodes can be considered as a change in memory address, and our model suggests important connections between recurrent neural activity and key generation mechanisms. However we note that barcodes should not be literally equated with keys in key-value systems as our model’s memory is ‘content-addresable’—it can be queried by place and seed inputs.

      The model includes a number of non-standard ingredients. It would be useful to explain which of these ingredients and which of the described mechanisms are essential for the studied phenomenon. In particular:

      - the dynamics in Eq.2 include a shunting inhibition term. Is it essential and why?

      The shunting inhibition is important as it acts to normalize the network activity to prevent runaway excitation. We hope to clarify this further by amending the following sentence in section 2.2: “g (·) is a leak rate that depends on the average activity of the full network, representing a form of global shunting inhibition that normalizes network activity to prevent runaway excitation from recurrent dynamics.”

      - same question for the global inhibition included in the random connectivity;

      The distribution from which connectivity strengths are drawn has a negative mean (global inhibition). This causes activity during caching (i.e. r = 1) to be sparser than activity during visits (i.e. r = 0), and was chosen to match experimental findings. In figures 2B and S2B we show that our model can transition between a mode with place code only, barcode only, or a mode containing both, by changing the variance of the weight distribution while holding the mean constant. We suggest clarifying this by editing the following in section 2.2, paragraph 2: “We initialize the recurrent weights from a random Gaussian distribution, . where 𝑁<sub>𝑋</sub> is the number of RNN neurons and μ < 0, reflecting global subtractive inhibition that encourages sparse network activity to match experimental findings (Chettih et al. 2024).”

      - the model is fully rate-based, but for certain figures, spikes are randomly generated. This seems superfluous.

      Spikes are simulated for one analysis and one visualization, where it is important to consider noise or variability in neural responses across trials. First, for Fig. 2H,J, we generated spikes to allow a visual comparison to figures that can be easily generated from experimental data. Second, and more significantly, for the analysis underlying Fig. 3D, it is essential to simulate variability in neural responses. Because our rate-based models are noiseless, the RNN’s rate vector at site distance = 0 will always be the same and result in a correlation of 1 for both visit-visit and cache-retrieval. However, we show that, if one interprets the rate as a noisy Poisson spiking process, the correlation at site distance = 0 between a cache-retrieval pair is higher than that of two visits. This is because under a Poisson spiking model, the signal-to-noise ratio is higher for cache-retrieval activity, where rates are higher in magnitude. The greater correlation for a cache-retrieval pair at the same site, relative to visits at the same site, is an experimental finding that was critical for our model to reproduce. We detail clarifications to the manuscript below in response to the reviewer’s following and related question.

      How are the correlations determined in the model (e.g., Fig 2 B)? The methods explain that they are computed from Poisson-generated spikes, but over which time period? Presumably during steady-state responses, but are these responses time-averaged?

      The reviewer points out a lack of clarity in our original manuscript. Correlations for events (caches, retrievals and visits) at different sites are calculated in two sections of the paper (2B, 3D), for different purposes and with slight differences in methods:

      - For figure 2B, no spikes are simulated. Note that the methods mentioning poisson spike generation specify only Fig. 2H,J and Fig. 3D. We simply take the network’s rate vector at timestep t=100 (when the decorrelating effect of chaotic dynamics has saturated, S1A-B) and correlate this vector when generated at different locations. We now clarify this in the legend for Figure 2B: “We show correlation of place inputs (gray) and correlation of the RNN's rate vector at t = 100 (black).”

      - For Figure 3D, we want to compare the model to empirical results from Chettih et al. 2024, and reproduced in this paper in Fig. 1E-F. These empirical results are derived from correlating vectors of spiking activity on pairs of single trials, and are thus affected by noise or variability in neural responses as described in our response to the reviewer’s previous question. We thus took the RNN’s rate vector at t=100 and simulated spiking data by drawing samples from a poisson distribution to get spike counts. Our original manuscript was unclear about this, and we suggest the following changes:

      - Legend for Figure 3D: D. Correlation of Poisson-generated spikes simulated from RNN rate vectors at two sites, plotted as a function of the distance between the two sites.

      - Section 2.3, last paragraph: Population activity during retrieval closely matches activity during caching, and is substantially decorrelated from activity during visits (Figure 3C). To compare our model with the empirical results reproduced in Figure 1E,F, we ran in silico experiments with caches and retrievals at varying sites in the circular arena. We simulated Poisson-generated spikes drawn from our network's underlying rates to match the intrinsic variability in empirical data (see Methods).

      - Methods, subsection Spatial correlation of RNN activity for cache-retrieval pairs at different sites: To calculate correlation values as in Figure \ref{fig3}D, we simulated experiments where 5 sites were randomly chosen for caching and retrieval. To compare model results to the empirical data in Fig. 1E,F, which includes intrinsic neural variability, we sampled Poisson-generated spike counts from the rates output by our model. Specifically, for RNN activity \vec{r_i} at location i, using the rates at t=100 as elsewhere, we first generate a sample vector of spikes…

      I was confused by early and late responses in Fig 2 C. The text says that the activity is initialized at zero, so the response at t=0 should be flat (and zero). More generally, I am not sure I understand why the dynamics matter for the phenomenon at all, presumably the decorrelation shown in Fig 2B depends only on steady state activity (cf previous question).

      Thanks for catching this mistake. The legend has been updated to indicate that the ‘early’ response is actually at t=1, when network activity reflects place inputs without the effects of dynamics. The reviewer is correct that we are primarily interested in the ‘late’ response of the network. All other results in the paper use this late response at t=100. As shown in Fig. S2A,B, this timepoint is not truly a steady state, as activity in the network continues to change, but the decorrelation of network activity with place-driven activity has saturated.

      We include the early response in Fig. 2C for visual comparison of the purely place-driven early activity with the eventual network response. It is also relevant since, as the reviewer points out above, there is a shunting inhibition term in the dynamics that is present during both low and high recurrent strength simulations.

      Related to the previous point, the discussion of decorrelation (l.79 - 97) is somewhat confusing. That paragraph focuses on chaotic activity, but chaos decorrelates responses across different time points. Here the main phenomenon is the decorrelation of responses across different spatial inputs (Fig 2B). This decorrelation is presumably due to the fact that different inputs lead to different non-trivial steady-state responses, but this requires some clarification. If that is correct, the temporal chaos adds fluctuations around these non-trivial steady-state responses, but that alone would not lead to the decorrelation shown in Fig 2B.

      We agree with the reviewer that chaotic activity produces a decorrelation across time points. Because of chaotic dynamics, network activity does not settle into a trivial steady-state, and instead evolves from the initial state in an unpredictable way. The network does not settle into a steady-state pattern, but both the decorrelation of network state with initial state and the rate of change in the network state saturate after ~t=25 timesteps, as shown in Fig. S2A-B.

      The initial activity for nearby states is similar, due to them receiving similar place inputs.

      Because network activity is chaotically decorrelated from this initial state by temporal dynamics, ‘late stage’ network activity between nearby spatial states is less correlated than ‘early stage’ activity. Thus the temporal decorrelation produces a spatial decorrelation. We believe that the changes we have introduced to the manuscript in revision will make this point clearer in our resubmission.

      A key ingredient of the model is that the recurrent interactions are switched on and off between "caching" and "visits". The discussion argues that a possible mechanism for this is recurrent inhibition (l.320), which would need to be added. However two forms of inhibition are already included in the model. The text also says that it is unclear how units in the model should be mapped onto E and I neurons. However the model makes explicit assumptions about this, in particular by generating spikes from individual neurons. Altogether, I did not find that part of the Discussion convincing.

      We agree with the reviewer that this section is a limitation of our current work, and in fact it is an ongoing area of future research. However we think the advances in this current work warrant publication despite this topic requiring further research. We attempted to discuss this limitation explicitly, and note that the other reviewer pointed this section out as particularly helpful. We do not think it is problematic for a realistic model of the brain to ultimately include 3, or even more forms of inhibition. We do not think that poisson-generated spikes commit us to interpreting network units as single neurons. Spikes are not a core part of our model’s mechanism, and were used only as a mechanism of introducing variability on top of deterministic rates for specific analyses. Furthermore one could still view network units as pools of both E and I spiking neurons. We would welcome further recommendations the reviewer believes are important to note in this section on our model’s limitations.

      On lines 117-120 the text briefly mentions an alternate feed-forward model and promptly discards it. The discussion instead says that a "separate possibility is that barcodes are generated in a circuit upstream of where memories are stored, and supplied as inputs to the hippocampal population", and that this possibility would lead to identical conclusions. The two statements seem a bit contradictory. It seems that the alternative possibility would replace the need for switching on and off recurrent interactions, with a mechanism where barcode inputs are switched on and off. This alternate scenario is perhaps more plausible, so it would be useful to discuss it more explicitly.

      We apologize for the confusion here, which seems to be due to our phrasing in the discussion section. We do reject the idea that a simple feed-forward model could generate the spatial correlation profile observed in data, as mentioned in the text and included as Fig. S2. Our statement in the discussion may have seemed contradictory because here we intended to discuss the possibility that an upstream area generates barcodes, for example by the chaotic recurrent dynamics proposed in our work, while a downstream network receives these barcodes as inputs and undergoes plasticity to store memories as attractors. We did not intend to suggest any connection to the feedforward model of barcode generation, and apologize for the confusion. Our claim that this ‘2 network’ solution would lead to similar conclusions is because the upstream network would need an efficient means of barcode generation, and the downstream network would need an efficient means of storing memory attractors, and separating these functions into different networks is not likely to affect for example the advantage of partially decorrelating memory attractors. Moreover, the downstream network would still require some form of recurrent gating, so that during visits it exhibits place activity without activating stored memory attractors!

      We thus chose a 1 network instead of a 2 network solution because it was simpler and, we believe, more interesting. It is challenging in the absence of more data to say which is more plausible, thus we wanted to mention the possibility of a 2 network solution. We suggest the following changes to the manuscript:

      - Discussion, 3rd paragraph: “Alternatively, other mechanisms may be involved in generating barcodes. We demonstrated that conventional feed-forward sparsification (Babadi and Sompolinsky, 2014; Xie et al., 2023) was highly inefficient, but more specialized computations may improve this (Földiak, 1990; Olshausen and Field, 1996; Sacouto and Wichert, 2023; Muscinelli et al., 2023). Another possibility is that barcodes are generated in a separate recurrent network upstream of the recurrent network where memories are stored. In this 2-network scenario, the downstream network receives both spatial tuning and barcodes as inputs. This would not obviate the need for modulating recurrent strength in the downstream network to switch between input-driven modes and attractor dynamics. We suspect separating barcode generation and memory storage in separate networks would not fundamentally affect our conclusions.”

      As a minor note, the beginning of the discussion states that the presented model is similar to previous recurrent network models of the hippocampus. It would be worth noting that several of the cited works assign a very different role to recurrent interactions: they generate place cell activity, while the present model assumes it is inherited from upstream inputs.

      We are not sure how best to modify the paper to address this suggestion. As far as we know, all of the cited models which deal with spatial encoding do assume that the hippocampus receives a spatially-modulated or spatially-tuned input. For example, the Tsodyks 1999 paper cited in this paragraph uses exponentially-decaying place inputs to each neuron highly similar to our model. Furthermore we explore how our model would perform if we change the format of spatial inputs in Fig. S4, and find key results are unchanged. It is unclear how hippocampal place fields could emerge without inputs that differentiate between spatial locations. We think it is appropriate to highlight the similarity of our model to well known hopfield-type recurrent models, where memories are stored as attractor states of the network dynamics.

      On the other hand, we agree that a common line of hippocampal modeling proposes that recurrent interactions reshape spatial inputs to produce place fields. This often arises in the context of hippocampus generating a predictive map, where inputs may be one-hot for a single spatial state, in a grid cell-like format, or a random projection of sensory features. We attempted to address this in section 2.6, using a model which superimposes the random connectivity needed for barcode generation with the structured connectivity needed for predictive map formation. We found that such a model was able to perform both predictive and barcode functions, suggesting a path forward to connecting different lines of hippocampal modeling in future work.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Conceptually, I feel that the authors addressed many concerns. However, I am still not convinced that their data support the strength of their claims. Additionally, I spent considerable time investigating the now freely available code and data and found several inconsistencies that would be critical to rectify. My comments are split into two parts, reflecting concerns related to the responses/methods and concerns resulting from investigation of the provided code/data. The former is described in the public review above. Because I show several figures to illustrate some key points for the latter part, an attached file will provide the second part: https://elife-rp.msubmit.net/elife-rp_files/2025/02/24/00136468/01/136468_1_attach_15_2451_convrt.pdf

      (1) This point is discussed in more detail in the attached file, but there are some important details regarding the identification of the learned trial that require more clarification. For instance, isn’t the original criterion by Gibbon et al. (1977) the first “sequence of three out of four trials in a row with at least one response”? The authors’ provided code for the Wilcoxon signed rank test and nDkl thresholds looks for a permanent exceeding of the threshold. So, I am not yet convinced that the approaches used here and in prior papers are directly comparable.

      We agree that there remain unresolved issues with our two attempts to create criteria that match that used by Gibbon and Balsam for trials to criterion. Therefore, we have decided to remove those analyses and return to our original approach showing trials to acquisition using several different criteria so as to demonstrate that the essential feature of the results—the scaling between learning rate and information—is robust. Figure 2A shows the results for a criterion that identifies the trial after which the cumulative response rate during the CS (=cumulative CS response count from Trial 1 divided by cumulative CS time from Trial 1) is consistently above the cumulative overall response rate across the trial (i.e., including both the CS and ITI). These data compare the CS response rate with the overall response rate, rather than with ITI rate as done in the previous version (in Figure 3A of that submission), to be consistent with the subsequent comparisons that are made using the nDkl. (The nDkl relies on the comparison between the CS rate and the overall rate, rather than between the CS and ITI rates.) Figures 2B and 2C show trials to acquisition when two statistical criteria, based on the nDkl, are applied to the difference between CS and overall response rates (the criteria are for odds >= 4:1 and p<.05). As we now explain in the text, a statistical threshold is useful inasmuch as it provides some confidence to the claim that the animals had learned by a given trial. However, this trial is very likely to be after the point when they had learned because accumulating statistical evidence of a difference necessarily adds trials.

      Also, there’s still no regression line fitted to their data (Fig 3’s black line is from Fig 1,according to the legends). Accordingly, I think the claim in the second paragraph of the Discussion that the old data and their data are explained by a model with “essentially the same parameter value” is not yet convincing without actually reporting the parameters of the regression. Related to this, the regression for their data based on my analysis appears to have a slope closer to -0.6, which does not support strict timescale invariance. I think that this point should be discussed as a caveat in the manuscript.

      We now include regression lines fitted to our data in Figures 2A-C, and their slopes are reported in the figure note. We also note on page 14 of the revision that these regressions fitted to our data diverge from the black regression line (slope -1) as the informativeness increases. On pages 14-15, we offer an explanation for this divergence; that, in groups with high informativeness, the effective informativeness is likely to be lower than the assigned value because the rats had not been magazine trained which means they would not have discovered the food pellet as soon as it was released on the first few trials. On pages 15-16, we go on to note that evidence for a change in response rate during the CS in those very first few trials may have been missed because the initial response rates were very low in rats trained with very long inter-reinforcement intervals (and thus high informativeness). We also propose a solution to this problem of comparing between very low response rates, one that uses the nDkl to parse response rates into segments (clusters of trials with equivalent response rates). This analysis with parsed response rates provides evidence that differential responding to the CS may have been acquired earlier than is revealed using trial-by-trial comparisons.

      (2) The authors report in the response that the basis for the apparent gradual/multiple step-like increases after initial learning remains unclear within their framework. This would be important to point out in the actual manuscript Further, the responses indicating the fact that there are some phenomena that are not captured by the current model would be important to state in the manuscript itself.

      We have included a paragraph (on page 26) that discusses the interpretation of the steady/multi-step increase in responding across continued training.

      (3) There are several mismatches between results shown in figures and those produced by the authors’ code, or other supplementary files. As one example, rat 3 results in Fig 11 and Supplementary Materials don’t match and neither version is reproduced by the authors’ code. There are more concerns like this, which are detailed in the attached review file.

      Addressed next….

      The following is the response to the points raised in Part 2 of Reviewer 1’s pdf.

      (1a) I plotted the calculated nDkl with the provided code for rat 3 (Fig 11), but itlooks different, and the trials to acquisition also didn’t match with the table  provided (average of ~20 trial difference). The authors should revise the provided code and plots. Further, even in their provided figures, if one compares rat 3 in Supplementary Materials to data from the same rat in Fig 11, the curves are different. It is critical to have reproducible results in the manuscript, including the ability to reproduce with the provided code.

      We apologise for those inconsistencies. We have checked the code and the data in the figures to ensure they are all now consistent and match the full data in the nHT.mat file in OSF. Figures 11 and 12 from the previous version are now replaced with Figure 6 in the revised manuscript (still showing data from Rats 3 and 176). The data plotted in Fig 6 match what is plotted in the supplementary figures for those 2 rats (but with slightly different cropping of the x-axes) and all plots draw directly from nHT.mat.

      (1b) I tried to replicate also Fig 3C with the results from the provided code, but I failed especially for nDkl > 2.2. Fig 3A and B look to be OK.

      There was error in the previous Fig 3C which was plotting the data from the wrong column of the Trials2Acquisition Table. We suspect this arose because some changes to the file were not updated in Dropbox. However, that figure has changed (now Figure 2) as already mentioned, and no longer plots data obtained with that specific nDkl criterion. The figure now shows criteria that do not attempt to match the Gibbon and Balsam criterion.

      (1c) The trials to learn from the code do match with those in the  Trials2Acquisition Table, but the authors’ code doesn’t reproduce the reported trials to learn values in the nDkl Acquisition Table. The trials to learn from the code are ~20 trials different on average from the table’s ones, for 1:20, 1:100, and 1:1000 nDkl.

      We agree that discrepancies between those different files were a source of potential confusion because they were using different criteria or different ways of measuring response rate (i.e., the “conventional” calculation of rate as number of responses/time, vs our adjusted calculation in which the 1<sup>st</sup> response in the CS was excluded as well as the time spent in the magazine, vs parsed response rates based on inter-response intervals). To avoid this, there is now a single table called Acquisition_Table.xlsx in OSF that includes Trials to acquisition for each rat based on a range of criteria or estimates of response rate in labelled columns. The data shown in Figure 2 are all based on the conventional calculation of response rate (provided in Columns E to H of Acquisition_Table.xlsx). To make the source of these data explicit, we have provided in OSF the matlab code that draws the data from the nHT.mat file to obtain these values for trials-to-acquisition.

      (1d) The nDkl Acquisition Table has columns with the value of the nDkl statistics at various acquisition landmarks, but the value does not look to be true, especially for rat 19. The nDkl curve provided by the authors (Supplementary Materials) doesn’t match the values in the table. The curve is below 10 until at least 300 trials, while the table reports a value higher than 20 (24.86) at the earliest evidence of learning (~120 trials?).

      We are very grateful to the reviewer for finding this discrepancy in our previous files. The individual plots in the Supplementary Materials now contain a plot of the nDkl computed using the conventional calculation of response rate (plot 3 in each 6-panel figure) and a plot of the nDkl computed using the new adjusted calculation of response rate (plot 4). These correspond to the signed nDkl columns for each rat in the full data file nHT.mat. The nDkl values at different acquisition landmarks included in Acquisition_Table.xlsx (Cols AB to AF) correspond to the second of these nDkl formulations. We point out that, of the acquisition landmarks based on the conventional calculation of response rate (Cols E to J of Acquisition_Tabls.xlsx), only the first two landmarks (CSrate>Contextrate and min_nDkl) match the permanently positive and minimum values of the plotted nDkl values. This is because the subsequent acquisition landmarks are based on a recalculation of the nDkl starting from the trial when CSrate>ContextRate, whereas the plotted nDkl starts from Trial 1.

      (2) The cumulative number of responses during the trial (Total) in the raw data table is not measured directly, but indirectly estimated from the pre-CS period, as (cumNR_Pre*[cumITI/cumT_Pre])+ cumNR_CS (cumNR_Pre: cumulative nose-poke response number during pre-CS period; cumITI: cumulative sum of ITI duration; cumT_Pre: cumulative pre-CS duration; cumNR_CS: cumulative response number during CS), according to ‘Explanation of TbyTdataTable (MATLAB).docx’.Why not use the actual cumulative responses during the whole trial instead of using a noisier measure during a smaller time window and then scaling it for the total period?

      Unfortunately, the bespoke software used to control the experimental events and record the magazine activity did not record data continuously throughout the experiment. The ITI responses were only sampled during a specified time-window (the “pre-CS” period) immediately before each CS onset. Therefore, response counts across the whole ITI had to be extrapolated.

      (3) Regarding the “Matlab code for Find Trials to Criterion.docx”:

      (a) What’s the rationale for not using all the trials to calculate nDkl but starting the cumulative summation from the earliest evidence trial (truncated)? Also, this procedure is not described in the manuscript, and this should be mentioned.

      The procedure was perhaps not described clearly enough in the previous manuscript. We have expanded that text to make it clearer (page 12) which includes the text…

      “We started from this trial, rather than from Trial 1, because response rate data from trials prior to the point of acquisition would dilute the evidence for a statistically significant difference in responding once it had emerged, and thereby increase the number of trials required to observe significant responding to the CS. The data from Rat 1 illustrates this point. The CS response rate of Rat 1 permanently exceeded its overall response rate on Trial 52 (when the nD<sub>KL</sub> also became permanently positive). The nD<sub>KL</sub>, calculated from that trial onwards, surpassed 0.82 (odds 4:1) after a further 11 trials (on Trial 63) and reached 1.92 (p < .05) on Trial 81. By contrast, the nD<sub>KL</sub> for this rat, calculated from Trial 1, did not permanently exceed 0.82 until Trial 83 and did not exceed 1.92 until Trial 93, adding 10 or 20 trials to the point of acquisition.”

      (3b) The authors' threshold is the trial when the nDkl value exceeds the threshold permanently.  What about using just the first pass after the minimum?

      Rat 19 provides one example where the nDkl was initially positive, and even exceeded threshold for odds 4:1 and p<.05, but was followed by an extended period when the nDkl was negative because the CS response rate was less than the overall response rate. It illustrates why the first trial on which the nDkl passes a threshold cannot be used as a reliably index of acquisition.

      (3c) Can the authors explain why a value of 0.5 is added to the cumulative response number before dividing it by the cumulative time?

      This was done to provide an “unbiased” estimate of the response count because responses are integers. For example, if a rat has made 10 responses over 100 s of cumulative CS time, the estimated rate should be at least 10/100 but could be anything up to, but not including, 11/100. A rate of 10.5/100 is the unbiased estimate. However, we have now removed this step when calculating the nDkl to identify trials to acquisition because we recognise that it would represent a larger correction to the rate calculated across short intervals than across long intervals and therefore bias comparison between CS and overall response rates that involve very different time durations. As such, the correction would artefactually inflate evidence that the CS response rate was higher than the contextual response rate. However, as noted earlier in this reply, we have now instituted a similar correction when calculating the pre-CS response rate over the final 5 sessions for rats that did not register a single response (hence we set their response count to 0.5).

      (3d) Although the authors explain that nDkl was set to negative if pre-CS rate is higher than CS rate, this is not included in the code because the code calculates the nDkl using the truncated version, starting to accumulate the poke numbers and time from the earliest evidence, thus cumulative CS rate is always higher than cumulative contextual rate. I expect then that the cumulative CS rate will be always higher than the cumulative pre-CS rate.

      Yes, that is correct. The negative sign is added to the nDkl when it is computed starting from Trial 1. But when it is computed starting from the trial when the CS rate is permanently > the overall rate, there is no need to add a sign because the divergence is always in the positive direction.

      (3e) Regarding the Wilcoxon signed rank test, please clarify in the manuscript that the input ‘rate’ is not the cumulative rate as used for the earliest evidence. Please also clarify if the rates being compared for the signed nDkl are just the instantaneous rates or the cumulative ones. I believe that these are the ‘cumulative’ ones (not as for Wilcoxon signed rank test), because if not, the signed nDkl curve of rat 3 would fluctuate a lot across the x-axis.

      The reviewer is correct in both cases. However, as already mentioned, we have removed the analysis involving the Wilcoxon test. The description of the nDkl already specifies that this was done using the cumulative rates.

      (4) Supplemental table ‘nDkl Acquisition Table.xlsx’ 3rd column (“Earliest”) descriptions are unclear.

      (a) It is described in the supplemental ‘Explanation of Excel Tables.docx’ as the ‘earliest estimate of the onset of a poke rate during the CSs higher than the contextual poke rate’, while the last paragraph of the manuscript’s method section says ‘Columns 4, 5 and 6 of the table give the trial after which conditioned responding appeared as estimated in the above described three different ways— by the location of the minimum in the nDkl, the last upward 0 crossings, and the CS parse consistently greater than the ITI parse, respectively. Column 3 in that table gives the minimum of the three estimates.’ I plotted the data from column 3 (right) and comparing them with Fig 3A (left) makes it clear that there’s an issue in this column. If the description in the ‘Explanation of Excel Tables.docx’ is incorrect, please update it.

      We agree that the naming of these criteria can cause confusion, hence we have changed them. On page 9 we have replaced “earliest” with “first” in describing the criterion plotted in Figure 2A showing the trial starting from which the cumulative CS response rate permanently exceeded the cumulative overall rate. What is labelled as “Earliest” in “Acquisition_Table.xlsx” is, as the explanation says, the minimum value across the 3 estimates in that table.

      (b) Also, the term ‘contextual poke rate’ in the 3rd column’s description isconfusing as in the nDkl calculation it represents the poke rate during all the training time, while in the first paragraph of the ‘Data analysis’ part, the earliest evidence is calculated by comparing the ITI (pre-CS baseline) poke rate.

      Yes, we have kept the term “contextual” response rate to refer to responding across the whole training interval (the ITI and the CS duration). This is used in calculation of the nDkl. For consistency with this comparison, we now take the first estimate of acquisition (in Fig 2A) based on a comparison between the CS rate and the overall (context) rate (not the pre-CS rate).

      Reviewer #2 (Recommendations for the authors):

      In response to the Rebuttal comments:

      Analytical (1) relating to Figure 3C/D

      This is a reasonable set of alternative analyses, but it is not clear that it answers the original comment regarding why the fit was worse when using a theoretically derived measure. Indeed, Figure 3C now looks distinctly different to the original Gibbon and Balsam data in terms of the shape of the relationship (specifically, the Group Median - filled orange circles) diverge from the black regression line.

      As mentioned in response to Reviewer 1, there was a mistake in Figure 3C of the revised manuscript. The figure was actually plotting data using a more stringent criterion of nDkl > 5.4, corresponding to p<0.001. The figure was referencing the data in column J of the public Trials2Acquisition Table. The data previously plotted in Figure 3C are no longer plotted because we no longer attempt to identify a criterion exactly matching that used by Gibbon and Balsam.

      We agree that the data shown in the first 3 panels of Figure 2 do diverge somewhat from the black regression line at the highest levels of informativeness (C/T ratios > 70), and the regression lines fitted to the data have slopes greater than -1. We acknowledge this on page 14 of the revised manuscript. Since Gibbon and Balsam did not report data from groups with such high ratios, we can’t know whether their data too would have diverged from the regression line at this point. We now report in the text a regression fitted to the first 10 groups in our experiment, which have C/T ratios that coincide with those of Gibbon and Balsam, and those regression lines do have slopes much closer to -1 (and include -1 in the 95% confidence intervals). We believe the divergence in our data at the high C/T ratios may be due to the fact that our rats were not given magazine training before commencing training with the CS and food. Because of this, it is quite likely that many rats did not find the food immediately after delivery on the first few trials. Indeed, in subsequent experiments, when we have continued to record magazine entries after CS-offset, we have found that rats can take 90 s or more to enter the magazine after the first pellet delivery. This delay would substantially increase the effective CS-US interval, measured from CS onset to discovery of the food pellet by the rat, making the CS much less informative over those trials. We now make this point on pages 14-15 of the revised manuscript.

      Analytical (2)

      We may have very different views on the statistical and scientific approaches here.

      This scalar relationship may only be uniquely applicable to the specific parameters of an experiment where CS and US responding are measured with the same behavioral response (magazine entry). As such, statements regarding the simplicity of the number of parameters in the model may simply reflect the niche experimental conditions required to generate data to fit the original hypotheses.

      To the extent that our data are consistent with the data reported decades ago by Gibbon and Balsam indicates the scalar relationship they identified is not unique to certain niche conditions since those special conditions must be true of both the acquisition of sign-tracking responses in pigeons and magazine entry responses in rats. How broadly it applies will require further experimental work using different paradigms and different species to assess how the rate of acquisition is affected across a wide range of informativeness, just as we have done here.

    1. Author response:

      Thank you for overseeing the review of our manuscript and for providing the eLife Assessment and Public Reviews. We are highly appreciative of the detailed, constructive feedback from the editors and reviewers.

      We acknowledge the core issues raised and we are committed to undertaking the necessary experiments and textual revisions to address every critique.

      Here is a summary of the key revisions we plan to undertake to address the major points raised:

      (1) Absolute yield comparison and efficiency clarification (eLife Assessment, R#3)

      We will perform new quantitative experiments to provide the absolute protein yield of our optimized eCFPS system and benchmark it against a published, widely recognized high-yield CFPS protocol. This will directly address the central requirement for industry comparison and strengthen the claim of "high efficiency." Furthermore, we will revise the manuscript's terminology, especially in the title and abstract, to accurately reflect the system's success in "streamlining" and "robustness" in addition to performance.

      (2) Mechanistic rationale for simplification (eLife Assessment, R#1)

      We will substantially expand the Discussion to provide a mechanistic explanation for why activity is maintained after removing up to 28 components. This analysis will focus on the retention of endogenous metabolic enzymes and residual factors within the "Fast Lysate," citing relevant literature (e.g., Yokoyama et al., 2010, as suggested by R#1) to support the role of metabolic pathways in compensating for the lack of exogenous tRNA, CTP/UTP, and specific amino acids.

      (3) Transcription-translation coupling (R#3)

      To address the concern that expression changes might be due to transcription rather than translation efficiency, we will perform control experiments to monitor mRNA levels under key optimized conditions. This will help confirm that the observed efficiency changes are primarily attributable to translation.

      (4) Data presentation and completeness (R#2)

      We will revise the presentation of data in figures (e.g., Figure 2) to use appropriate graph types for discrete data and ensure all units, incubation times, and conditions are clearly and consistently specified. Furthermore, we will add a paragraph to the Discussion addressing the study's limitations, specifically the potential implications of DTT removal for certain protein types.

      We are confident that these planned revisions will address the reviewers' recommendations and result in a stronger manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):           

      Summary:

      The authors have created a new model of KCNC1-related DEE in which a pathogenic patient variant (A421V) is knocked into a mouse in order to better understand the mechanisms through which KCNC1 variants lead to DEE.  

      Strengths:

      (1)  The creation of a new DEE model of KCNC1 dysfunction. 

      (2)  In Vivo phenotyping demonstrates key features of the model such as early lethality and several types of electrographic seizures. 

      (3)  The ex vivo cellular electrophysiology is very strong and comprehensive including isolated patches to accurately measure K+ currents, paired recording to measure evoked synaptic transmission, and the measurement of membrane excitability at different time points and in two cell types.

      We thank Reviewer 1 for these positive comments related to strengths of the study.   

      Weaknesses:

      (1) The assertion that membrane trafficking is impaired by this variant could be bolstered by additional data.

      We agree with this comment. However, given the technical challenges of standard biochemical experiments for investigating voltage-gated potassium channels (e.g., antibody quality), the lack of a Kv3.1-A421V specific antibody, and the fact that Kv3.1 is expressed in only a small subset of cells, we did not undertake this approach. However, we did perform additional experiments and analysis to improve the rigor of the experiments supporting our conclusion that membrane trafficking is impaired in the Kcnc1-A421V/+ mouse. 

      Such experiments support a highly significant and robust difference in our (albeit imperfect) measurement of the membrane:cytosol ratio of Kv3.1 immunofluorescence between WT and Kcnc1-A421V/+ mice, which is consistent with lack of membrane trafficking (Figure 3). In the revised manuscript, we have added additional data points to this plot and updated the representative example images using improved imaging techniques to better showcase how Kcnc1-A421V/+ PV-INs differ from age-matched WT littermate controls. We think the result is quite clear. Future biochemical experiments perhaps best performed in a culture system in vitro could provide additional support for this conclusion.

      (2) In some experiments details such as the age of the mice or cortical layer are emphasized, but in others, these details are omitted.

      We apologize for this omission. We have now clarified the age of the mice and cortical layer for each experiment in the Methods and Results sections as well as figure legends.   

      (3) The impairments in PV neuron AP firing are quite large. This could be expected to lead to changes in PV neuron activity outside of the hypersynchronous discharges that could be detected in the 2-photon imaging experiments, however, a lack of an effect on PV neuron activity is only loosely alluded to in the text. A more formal analysis is lacking. An important question in trying to understand mechanisms underlying channelopathies like KCNC1 is how changes in membrane excitability recorded at the whole cell level manifest during ongoing activity in vivo. Thus, the significance of this work would be greatly improved if it could address this question.

      Yes, the impairments in the neocortical PV-IN excitability are notably severe relative to other PV interneuronopathies that we and others have directly investigated (e.g., Kv3.1 or Kv3.2-/- knockout mice; Scn1a+/- mice). In the revised version of the manuscript, we have now added a more thorough in vivo 2P calcium imaging investigation and analysis of our in vivo 2P calcium imaging data of PV-IN (and presumptive excitatory cell) neural activity (Figure 8 and Supplementary Figure 9, Methods- lines 230-271 Results- lines 630-657, and Discussion lines- 795-814). 

      Because of the prominent recruitment of neuropil during presumptive myoclonic seizures, further investigation of individual neuronal excitability in vivo required a slightly different labeling strategy now using a soma-tagged GCaMP8m as well as a separate AAV containing tdTomato driven by the PV-IN-specific S5E2 enhancer. Our new results reveal an increase in the baseline calcium transient frequency in non-PV-INs, and reduced mean transient amplitudes in both non-PV cells and PV-INs. These interesting findings, which are consistent with attenuated PV-IN-mediated perisomatic inhibition leading to disinhibited excitatory cells in the Kcnc1-A421V/+ mice, link our in vivo results to the slice electrophysiology experiments. Of course, there are residual issues with the application of this technique to interneurons and the ability to resolve individual or small numbers of spikes, which likely explains the lack of genotype difference in calcium transient frequency in PV-INs.

      (4) Myoclonic jerks and other types of more subtle epileptiform activity have been observed in control mice, but there is no mention of littermate control analyzed by EEG. 

      We performed additional experiments as requested and did not observe myoclonic jerks or any other epileptic activity in WT control mice. We have included this data in the revised manuscript (Figure 9C).   

      Reviewer #2 (Public review):           

      Summary:

      Wengert et al. generated and thoroughly characterized the developmental epileptic encephalopathy phenotype of Kcnc1A421V/+ knock-in mice. The Kcnc1 gene encodes the Kv3.1 channel subunit. Analogous to the role of BK channels in excitatory neurons, Kv3 channels are important for the recurrent high-frequency discharge in interneurons by accelerating the downward hyperpolarization of the individual action potential. Various Kcnc1 mutations are associated with developmental epileptic encephalopathy, but the effect of a recurrent A421V mutation was somewhat controversial and its influence on neuronal excitability has not been fully established. In order to determine the neurological deficits and underlying disease mechanisms, the authors generated cre-dependent KI mice and characterized them using neonatal neurological examination, high-quality in vitro electrophysiology, and in vivo imaging/electrophysiology analyses. These analyses revealed excitability defects in the PV+ inhibitory neurons associated with the emergence of epilepsy and premature death. Overall, the experimental data convincingly support the conclusion.

      Strengths:

      The study is well-designed and conducted at high quality. The use of the Cre-dependent KI mouse is effective for maintaining the mutant mouse line with premature death phenotype, and may also minimize the drift of phenotypes which can occur due to the use of mutant mice with minor phenotype for breeding. The neonatal behavior analysis is thoroughly conducted, and the in vitro electrophysiology studies are of high quality.

      We appreciate these positive comments from Reviewer 2. 

      Weaknesses:

      While not critically influencing the conclusion of the study, there are several concerns.

      In some experiments, the age of the animal in each experiment is not clearly stated. For example, the experiments in Figure 2 demonstrate impaired K+ conductance and membrane localization, but it is not clear whether they correlated with the excitability and synaptic defects shown in subsequent figures. Similarly, it is unclear how old mice the authors conducted EEG recordings, and whether non-epileptic mice are younger than those with seizures. 

      We have now updated the manuscript to include clear report of age for all experiments including the impaired K<sup>+</sup> conductance (now Figure 3) and EEG (now Figure 9). There was no intention to omit this information. The recordings of K<sup>+</sup> conductance impairments in PV-INs from Kcnc1-A421V/+ mice were completed at P1621. Thus, we interpret the loss of potassium current density to be causally linked with the impairments in intrinsic physiological function at that same time-period in neocortical layer II-IV PV-INs and more subtly in PV-positive cells in the RTN and neocortical layer V PVINs.

      Mice used in the EEG experiments were P24-48, an age range which roughly corresponded with the midpoint on the survival curve for Kcnc1-A421V/+ mice. Although we saw significant mouse-to-mouse variability in seizure phenotype, no Kcnc1-A421V/+ mice completely lacked epilepsy or marked epileptiform abnormalities, neither of which were seen in WT mice. We did not detect a clear relationship between seizure frequency/type and mouse age. 

      The trafficking defect of mutant Kv3.1 proposed in this study is based only on the fluorescence density analysis which showed a minor change in membrane/cytosol ratio. It is not very clear how the membrane component was determined (any control staining?). In addition to fluorescence imaging, an addition of biochemical analysis will make the conclusion more convincing (while it might be challenging if the Kv3.1 is expressed only in PV+ cells).

      This relates to comment 3 of Reviewer 1. We agree that, in the initial submission of the manuscript, the evidence from IHC for Kv3.1 trafficking deficits was somewhat subtle. In the revised version of the paper, we have gathered additional replicates of this original experiment with improved imaging quality and clarify how the membrane component was specified, to now show a robust and highly significant (***P<0.001) decrease in membrane:cytosol Kv3.1 ratio. We have also now provided new example images better showcasing the deficits observed in the Kcnc1-A421V/+ mice (Figure 3). The membrane compartment was defined as the outermost 1 micron of the parvalbumin-defined cell soma (drawn blind to the Kv3.1b signal), and, importantly, all analysis was conducted blinded to mouse genotype. These measures help to ensure that the result is robust and unbiased. Nonetheless, we have added a paragraph in the Discussion section highlighting the limitations of our IHC evidence for trafficking impairment (Lines 868-883). 

      While the study focused on the superficial layer because Kv3.1 is the major channel subunit, the PV+ cells in the deeper cortical layer also express Kv3.1 (Chow et al., 1999) and they may also contribute to the hyperexcitable phenotype via negative effect on Kv3.2; the mutant Kv3.1 may also block membrane trafficking of Kv3.1/Kv3.2 heteromers in the deeper layer PV cells and reduce their excitability. Such an additional effect on Kv3.2, if present, may explain why the heterozygous A421V KI mouse shows a more severe phenotype than the Kv3.1 KO mouse (and why they are more similar to Kv3.2 KO). Analyzing the membrane excitability differences in the deep-layer PV cells may address this possibility.

      We appreciate this thoughtful suggestion. We have now provided data from neocortical layer V PV interneurons in the revised manuscript (Supplementary Figure 5). Abnormalities in intrinsic excitability from neocortical layer V PV-INs in Kcnc1A421V/+ mice were present, but less pronounced than in PV-INs from more superficial cortical layers. These results are consistent with the view that greater relative expression of Kv3.2 “dilutes” the impact of the Kv3.1 A421V/+ variant. More specific determination of whether the A421V/+ variant impairs membrane trafficking and/or gating of Kv3.2 remains unclear. 

      We attempted to assess how the mutant Kv3.1 affects Kv3.2 localization, but were unsuccessful due to the lack of reliable antibodies. After immunostaining mouse brain sections with two different anti-Kv3.2 antibodies, only one produced somewhat promising signal (see below). However, even in this case, Kv3.2 staining was successful only once (out of five independent staining experiments) and the signal varied across cortical regions, showing widespread cellular Kv3.2 signal in some areas (b, top panel), and barely detectable signal in others, regardless of Kv3.1 expression. In the remaining four attempts, we detected only ‘fiber-like’ immunostaining signal, further diminishing our confidence in anti-Kv3.2 antibody, although results could be improved with still further testing and refinement which we will attempt. Consequently, this important question remains unsolved in this study. 

      Author response image 1.

      Immunostaining of Kv3.1 and Kv3.2 in sagittal mouse brain sections. a) An example of intracellular Kv3.2 immunostaining signal, variable across the cortex of a WT mice independent of Kv3.1 expression b) Kv3.2 is detectable intracellularly in most of the cells in the top panel but barely detectable in the lowest panel. c) Representative image of Kv3.2 immunostaining signal in other sagittal mouse brain sections.

      We have discussed these important implications and limitations of our results in the Discussion (Lines 868-883). We agree with the Reviewer’s interpretation that an impact on Kv3.1/Kv3.2 heteromultimers across the neocortex may explain why the Kcnc1A421V/+ mouse exhibits a more severe phenotype than Kv3.1-/- or Kv3.2-/- mice (see below), a view which we have attempted to further clarify in the Conclusion.    

      In Table 1, the A421V PV+ cells show a depolarized resting membrane potential than WT by ~5 mV which seems a robust change and would influence the circuit excitability. The authors measured firing frequency after adjusting the membrane voltage to -65mV, but are the excitability differences less significant if the resting potential is not adjusted? It is also interesting that such a membrane potential difference is not detected in young adult mice (Table 2). This loss of potential compensation may be important for developmental changes in the circuit excitability. These issues can be more explicitly discussed.

      We do not entirely understand this finding and its apparent developmental component. It could be compensatory, as suggested by the Reviewer; however, it is transient and seems to be an isolated finding (i.e., it is not accompanied by compensation in other properties). It is also possible that this change in Kcnc1-A421V/+ PV-INs may reflect impaired/delayed development. We cannot test excitability at a meaningfully later time point as the mice are deceased.

      The revised version of the manuscript contains additional data (Supplementary Figure 4) showing that major deficits in intrinsic excitability are still observed even when the resting membrane potential is left unadjusted. These results are further discussed in the Results section (lines 522-523) and the Discussion section (lines 727-731).   

      Reviewer #3 (Public review):           

      Summary:

      Here Wengert et al., establish a rodent model of KCNC1 (Kv3.1) epilepsy by introducing the A421V mutation. The authors perform video-EEG, slice electrophysiology, and in vivo 2P imaging of calcium activity to establish disease mechanisms involving impairment in the excitability of fast-spiking parvalbumin (PV) interneurons in the cortex and thalamic PV cells.

      Outside-out nucleated patch recordings were used to evaluate the biophysical consequence of the A421V mutation on potassium currents and showed a clear reduction in potassium currents. Similarly, action potential generation in cortical PV interneurons was severely reduced. Given that both potassium currents and action potential generation were found to be unaffected in excitatory pyramidal cells in the cortex the authors propose that loss of inhibition leads to hyperexcitability and seizure susceptibility in a mechanism similar to that of Dravet Syndrome.  

      Strengths: 

      This manuscript establishes a new rodent model of KCNC1-developmental and epileptic encephalopathy. The manuscript provides strong evidence that parvabumin-type interneurons are impaired by the A421V Kv3.1 mutation and that cortical excitatory neurons are not impaired. Together these findings support the conclusion that seizure phenotypes are caused by reduced cortical inhibition.

      We thank Reviewer 3 for their view of the strengths of the study.

      Weaknesses:

      The manuscript identifies a partial mechanism of disease that leaves several aspects unresolved including the possible role of the observed impairments in thalamic neurons in the seizure mechanism. Similarly, while the authors identify a reduction in potassium currents and a reduction in PV cell surface expression of Kv3.1 it is not clear why these impairments would lead to a more severe disease phenotype than other loss-of-function mutations which have been characterized previously. Lastly, additional analysis of videoEEG data would be helpful for interpreting the extent of the seizure burden and the nature of the seizure types caused by the mutation.

      We agree with this comment(s) from Reviewer 3. We studied neurons in the reticular thalamus and layer V neocortical PV-INs since they are also linked to epilepsy pathogenesis and are known to express Kv3.1. However, for most of the study, we focused on neocortical layer II-IV PV-INs, because these cells exhibited the most robust impairments in intrinsic excitability. Cross of our novel Kcnc1-Flox(A421V)/+ mice to a cerebral cortex interneuron-specific driver that would avoid recombination in the thalamus, such as Ppp1r2-Cre (RRID:IMSR_JAX:012686), could assist in determining the relative contribution of thalamic reticular nucleus dysfunction to overall phenotype as used by (Makinson et al., 2017) to address a similar question; however, we have been unable to obtain this mouse despite extensive effort. There are of course other Kv3.1expressing neurons in the brain, including in the hippocampus, amygdala, and cerebellum, and we have provided additional discussion (Lines 731-736) of this issue.

      We further agree with the Reviewer that a major question in the field of KCNC1-related neurological disorders is the mechanistic underpinning of why the KCNC1-A421V variant leads to a more severe disease phenotype than other loss of function KCNC1 variants, and, further, why the mouse phenotype is more severe than the Kcnc1 knockout. Previous results and our own recordings in heterologous systems suggest that the A421V variant is more profoundly loss of function than the R320H variant (Oliver et al., 2017; Cameron et al., 2019; Park et al., 2019), which is consistent with A421V having a more severe disease phenotype. Relative to knockout of Kv3.1, our results are consistent with the view that the A421V exhibits dominant negative activity by reducing surface expression of Kv3.1 and/or Kv3.2 (an effect that would not occur in knockout mice), with a possible additional contribution of impairing gating of those Kv3.1-A421V variant containing Kv3.1/Kv3.2 heteromultimers by inclusion of A421V subunits into the heterotetramer. Our finding that the magnitude of total potassium current was reduced in PV-INs by ~50% is consistent with a combination of these various mechanisms but does not distinguish between them.

      In the revised version of the manuscript, we have provided a more complete discussion of these important remaining questions regarding our interpretation of how the severity of KCNC1 disorders relates to the biophysical features of the ion channel variant (lines 868883).

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):          

      Major

      (1) The authors suggest that the reduced K+ current density in Kcnc1-A421V/+ neurons is due in part to impaired trafficking and cell surface expression of Kv3.1 in these neurons. The data supporting this claim aren't completely convincing. First, it's difficult to visualize a difference in Kv3.1 localization in the images shown in panel H, and importantly, it seems problematic that the method to assess Kv3.1 levels in membrane vs. cytosol relied on using PV co-staining to define the membrane compartment as the outermost 1 um of the PV-defined cell soma. This doesn't seem to be the best method to define the membrane compartment, as the PV signal should be largely cytosolic.

      As noted above, we have completed additional data collection to confirm our results, and have performed additional imaging and updated our example images to be more representative of the observed deficits in membrane Kv3.1 expression in the Kcnc1-A421V/+ mice. We attempted to identify a marker to more clearly label the membrane to combine with PV immunocytochemistry but were unable to do so despite some effort. 

      Is it possible that in control neurons, the cytosolic PV signal localizes within the membrane-bound Kv3.1 signal, with less colocalization, whereas in Kcnc1-A421V/+ neurons, there would be more colocalization of the cytosolic PV and improperly trafficked Kv3.1.? Could the data be presented in this way showing altered colocalization of Kv3.1 with PV?

      We do not entirely understand the nature of this concern. In our experiments, we utilized the PV signal to determine the cell membrane and cytosolic compartments in an unbiased manner using a 1-micron shell traced around/outside the edge of the PV signal to define the membrane compartment, with the remainder of the area (minus the nuclear signal defined by DAPI) defined as the cytosol (see Methods 176-186). Because we did not identify any alterations in PV signal or correlation between PV immunohistochemistry and tdTomato expression in Cre reporter strains between WT and Kcnc1-A421V/+ mice, we believe that our strategy for determining membrane:cytosol ratio of Kv3.1 in an unbiased manner is acceptable (albeit of course imperfect). 

      Alternatively, membrane fractionation could be performed on WT vs Kcnc1-A421V/+ neurons, followed by Western blotting with a Kv3.1 antibody to show altered proportions in the cytosolic vs. membrane protein fractions. It's important that these results are convincing, as the findings are mentioned in the Abstract, the Results section, and multiple times in the Discussion, although it is still unclear how much the potential altered trafficking contributes to the decrease in K+ currents versus changes in channel gating.

      Multiple technical barriers made it difficult for us to gain direct biochemical evidence for altered trafficking of the A421V/+ Kv3.1 variant (see above). It is not clear how membrane fractionation techniques could be easily applied in this case (at least by us) when PV-INs constitute 3-5% of all neocortical neurons. We further agree (as noted above) that it is difficult to properly disentangle the relative roles of impaired membrane trafficking vs. gating deficits to the observed effect; however, we think that both phenomena are likely occurring. In the revised version of the manuscript, we have more explicitly discussed these limitations in the Discussion section (Lines 868-883).   

      (2) More information is needed regarding the age of mice used for experiments for the following results (added to the Results section as well as figure legends):

      PV density (Supplementary Figure 1) 

      K+ current data (Figure 2A-G)       

      Kv3.1 localization (Figure 2H and I)        

      RTN electrophysiology (Supplementary Figure 3)

      Excitatory neuron electrophysiology (Figure 4)             

      In vivo 2P calcium imaging (Figure 7) 

      Video-EEG (Figure 8)

      We apologize for omitting this critical information. In the revised manuscript, we have provided the age of mice for each of our experiments in the results section, in the figure legend, and in the methods section.   

      (3) It's unclear why developmental milestones/behavioral assessments were only done at P5-P10. In the previous publication of another Kcnc1 LOF variant (Feng et al. 2024), no differences were found at P5-P10, and it was suggested in the discussion that this finding was "consistent with the known developmental expression pattern of Kv3.1 in mouse, where Kv3.1 protein does not appear until P10 or later". In that paper, they did find behavioral deficits at 2-4 months. Even though this model is more severe than the previous model, it would be interesting to determine if there are any behavioral deficits at a later time point (especially as they find more neurophysiological impairments at P32P42).

      As in our previous study, the lack of clear behavioral deficits in developmental milestones from P5-15 is potentially expected considering the developmental expression of Kv3.1, and we performed these experiments primarily to showcase that the Kcnc1-A421V/+ mice exhibit otherwise normal overall early development (although this could be an artifact of the sensitivity of our testing methods).

      For the revised manuscript, we have conducted additional experiments to investigate behavioral deficits in adult Kcnc1-A421V/+ mice. We found cognitive/learning deficits in both Kcnc1-A421V/+ mice relative to WT in both the Barnes maze (Figure 2A-C) and Ymaze (Figure 2D-F). Other aspects of animal behavior including cerebellar-related motor function are likely also impaired at post-weaning timepoints, and will be included in a forthcoming research study focusing on the motor function in these mice.  

      (4) In the Results section, it should be more clearly stated which cortical layer/layers are being studied. In some cases, it mentions layers 2-4, and in some, only layer 4, and in others, it doesn't mention layers at all. Toward the beginning of the Results section, the rationale for focusing on layers 2-4 to assess the effects of this variant should be well described and then, for each experiment, it should be stated which cortical layers were assessed. Related to this point, it seems electrophysiology was only done in layer 4; the rationale for this should also be included.

      We have now clarified which neocortical layers were under investigation in the study. All PV-INs were targeted in somatosensory layers II-IV, while excitatory neurons were either cortical layer IV spiny stellate cells or pyramidal cells. Paired recordings were also completed in layer IV. We have also more explicitly articulated our rationale for looking at PV-INs in layers II-IV to examine the cellular/circuitlevel impact of Kv3.1 in a model of developmental and epileptic encephalopathy (Lines 487-491). 

      (5) Kcnc1-A421V/+ PV neurons showed more robust impairments in AP shape and firing at P32-42 than at P16-21 (Figure 3), and only showed synaptic neurotransmission alterations at P32-42 (Figure 6). Thus, it's unclear why Kcnc1-A421V/+ excitatory neurons were only assessed at P16-21 (Figure 4 and Supplementary Figure 4 related to Figure 5), particularly if only secondary or indirect effects on this population would be expected.

      We appreciate this excellent point raised by the Reviewer and we have taken the suggestion to examine excitatory neurons at P32-42 in addition to the earlier juvenile timepoint. Our new results from the later timepoint are similar to our results at P16-21: Excitatory neurons show no statistically significant impairments in intrinsic excitability at either of the two timepoints examined (Supplementary Figure 7). This adds support to our original conclusion that PV-INs represent the major driver of disease pathology across development.   

      (6) The 2P calcium imaging experiments are potentially interesting, however, a relationship between these results and the electrophysiology results for PV neurons is lacking. Was there an attempt to assess the frequency and/or amplitude of calcium events specifically in PV neurons, outside of the hypersynchronous discharges, to determine whether there are differences between WT and Kcnc1-A421V/+, as was seen in the electrophysiological analyses? It does seem there are some key differences between the two experiments (age: later timepoint for 2P vs. P16-21 and P32-42, layer: 2/3 vs. 4, and PV marking method: virus vs. mouse line), but the electrophysiological differences reported were quite strong. Thus, it would be surprising if there were no alterations in calcium activity among the Kcnc1-A421V/+ PV neurons.

      In our initial experiments, the prominent neuropil GCaMP signal in Kcnc1-A421V/+ mice rendered it difficult to distinguish and accurately describe baseline neuronal excitability in PV-INs and non-PV cells. In our revised manuscript, we utilized a soma-tagged GCaMP8m and separately labeled PV-INs through S5E2-tdTomato. This strategy made it possible to assess the amplitude and frequency of calcium transients in both PV-positive and PV-negative cells in vivo. We have updated the description of our methods (lines 230-271) and our results (lines 630-657) in the revised manuscript.

      As noted above, our more detailed analysis of somatic calcium transients in PV-IN and non-PV cells during quiet rest (Figure 8 and Supplementary Figure 9) shows that PV-INs from Kcnc1-A421V/+ mice are abnormally excitable- having reduced transient amplitude relative to WT controls. Interestingly, non-PV cells also exhibited an increased calcium transient frequency and reduced amplitude which is potentially consistent with reduced perisomatic inhibition causing disinhibition in cortical microcircuits. We again highlight that the slow kinetics of GCaMP combined with the calcium buffering and brief spikes of PVINs render quantification of action potential frequency and comparisons between groups difficult.  

      (7) As mentioned above, it would be helpful to state the time points or age ranges of these experiments to better understand the results and relate them to each other. For example, the 2P imaging showed apparent myoclonic seizures in 7/7 Kcnc1-A421V/+ mice (recorded for a total of 30-50 minutes/mouse), but the video-EEG showed myoclonic seizures in only 3/11 Kcnc1-A421V/+ mice (recorded for 48-72 hours/mouse). Were these experiments done at very different age ranges, so this difference could be due to some sort of progression of seizure types and events as the mice age? Is it possible these are not the same seizure types (even though they are similarly described)? This discrepancy should be discussed.

      Mice in the EEG experiments were between the ages of P24 and 48, slightly younger than the age in which we carried out the in vivo calcium imaging experiments (>P50). Therefore, an age-related exacerbation in myoclonic jerks is possible. 

      As is highlighted by the Reviewer, it is interesting that the myoclonic seizures were only detected in a portion of the Kcnc1-A421V/+ mice during EEG monitoring (4/12). We believe that the difference is most likely driven by more sensitive detection of the myoclonic jerk activity and behavior in the 2P imaging of neuropil cellular activity compared to our video-EEG monitoring and 2P imaging of soma-tagged GCaMP. We have occasionally observed repetitive myoclonic jerking in mice that appears highly localized (i.e. one forepaw only) suggesting that the myoclonic seizures exist on a spectra of severity from focal to diffuse. It is therefore possible that myoclonic events and electrographic activity may be slightly underestimated in our video-EEG experiments? 

      We have now added a few lines discussing this discrepancy in the Discussion (lines 809814).   

      (8) Myoclonic jerks and other types of more subtle epileptiform activity have been observed in control mice. Was video-EEG performed on control mice? These data should be added to Figure 8.

      We have added recordings in control WT mice (N=4). We did not detect myoclonic jerks or other epileptiform activity in the control mice (Figure 9).  

      Minor

      (1) In the first Results section, Line 365, the P value (P<0.001) is different from that in the legend for Figure 1, line 743 (P<0.0001).

      We have fixed this discrepancy. 

      (2) For Supplementary Figure 1, it would be helpful to show images that span the cortical layers (1-6), as PV and Kv3.1 are both expressed across the cortical layers.

      We have updated Supplementary Figure 1 with better example images that span the cortical layers.    

      (3) Error bars should be added to the line graphs in Supplementary Figure 2, particularly panels B and C. Some of the differences appear small considering the highly significant p-values (i.e. body weight at P7 and brain weight at P21).

      The values shown in Supplementary Figure 2D-E are percentages of mice displaying a particular characteristic, so there is no variance for the data.

      Supplementary Figure 2B-C actually do contain error bars plotted as SEM, however, because of the large number of N and small degree of variance in the measurements, the error bars are not apparent in the graphs. This has been noted in the Supplementary Figure 2 legend for clarity. 

      (4) In Figure 3, although the Kcnc1-A421V/+ neurons have elevated AP amplitudes relative to WT, the representative traces for P16-21 and P32-42 groups appear strikingly opposite (traces in B in G appear to have much higher amplitudes than those in C and H). As this is one of the three AP phenotypes described, it would be nice to have it reflected in the traces.

      We have updated our example traces to better represent our main findings including AP amplitude for both P16-21 and P32-42 timepoints.  

      (5) Were any effects on the AHP assessed in the electrophysiology experiments? As other studies have reported the effects of altered Kv3 channel activity on AHP, this parameter could be interesting to report as well.

      We have now provided data on the afterhyperpolarization for each condition displayed in the Supplementary data tables. Interestingly, we failed to detect significant differences in AHP between WT and Kcnc1-A421V/+ PV-INs, RTN neurons, or pyramidal cells, although we did identify differences in the dV/dt of the repolarization phase of the AP.   

      (6) The figure legend for Figure 7 has errors in the panel labeling (D instead of C, and two Fs).

      This error has been corrected in the revised manuscript.

      Reviewer #3 (Recommendations for the authors):

      Specific comments and questions for the authors:         

      (1) Do the authors provide a reason for why the juvenile animals are unaffected by the A421V mutation? Is it that PV cells have not fully integrated at this early time point or that Kv3.1 expression is low? Is the developmental expression profile of Kv3.1 in PV cells known and if so could the authors update the discussion with this information?

      We interpret the normal early developmental milestones (P5-P15) to reflect that Kcnc1-A421V/+ mice exhibit the onset of their neurological impairment at the same time that PV-INs upregulate Kv3.1, develop a fast-spiking physiological phenotype, and integrate into functional circuits in the third and fourth postnatal weeks. We have updated the discussion (Line 780-782) with this information and more clearly describe our interpretation of these early-life behavioral experiments.   

      (2) I would like to see a more complete analysis of the Video-EEG data that is included in Figure 8. What was the seizure duration and frequency? Were there spike-wave seizure types observed? Were EEG events that involve thalamocortical circuitry affected such as spindles? Was sleep architecture impaired in the model? Were littermate control animals recorded?

      Although classical convulsive seizures represent only part of the overall epilepsy phenotype that this mouse exhibits, we agree that reporting seizure duration and frequency is important. We have now included this in our revised manuscript (line 624-626). We have also now added WT control mice to our dataset, and, as expected, we failed to observe any epileptic features in our WT recordings.

      In our EEG experiments, we did not record EMG activity in the mouse to allow for unambiguous determination of sleep vs. quiet wakefulness. For that reason, and because we believe it beyond the scope of this particular study, we did not examine sleep-related EEG phenomena such as spindles or sleep architecture. We have, however, added a line in the discussion (line 771-774) suggesting that future studies focus on a more thorough investigation of the EEG activity in these animals. 

      (3) The in vivo calcium imaging data shows synchronous bursts in A421V animals which is in agreement with the synchronous bursts observed in the EEG. Overall the analysis of the in vivo calcium imaging data appears to be rudimentary and perhaps this is a missed opportunity. What additional insights were gained from this technically demanding experiment that were not obtained from the EEG recordings?

      As noted above, in the revised version of the manuscript, we have conducted additional experiments which allowed us to separately examine PV-IN and non-PV neuron excitability via 2P in vivo calcium imaging. This required an alternative strategy to label individual neuronal somata without contamination by the robust neuropil signal that we observed in the approach undertaken in the original submission. We’ve described the details of this new approach in methods (Lines 230-271) and results section (lines 630-657).

      Our new results (Figure 8 and Supplementary Figure 9) reveal that, during quiet rest, neocortical PV-INs from Kcnc1-A421V/+ mice exhibit a reduction in calcium transient amplitude during quiet wakefulness and that non-PV cells exhibit altered transient frequency and amplitude. Overall, we believe that these results are consistent with the view that PV-IN-mediated perisomatic inhibition is compromised in Kcnc1-A421V/+ mice which leads to a downstream hyperexcitability in excitatory neurons within cortical microcircuits.  

      (4) The increased severity of seizure phenotypes observed in the A421V model relative to knockout mice is interesting but also confusing given what is known about this mutation. As the authors point out, a possible explanation is that the mutation is acting in a dominant negative manner, where mutant Kv3.1 channels compete with other Kvs that would otherwise be able to partially compensate for the loss of Kv function. Alternatively, the A421V mutation might act by affecting the trafficking of heterotetrameric Kv3 channels to the membrane. Can the authors clarify why a trafficking deficit would produce a different effect than a loss of function mutation? Are the authors proposing that a hypomorphic mutation involving both a partial trafficking deficit and a dominant negative effect of those channels that are properly localized is more severe than a "clean" loss of function? The roughly 50% loss of potassium current absent a change in gating would be expected to behave like a loss-of-function mutation. This might be addressed by comparing the surface expression of the other Kv channels and/or through the use of Kv3.1-selective pharmacology.

      These are excellent points raised by the Reviewer. As noted above, we have endeavored to clarify our hypothesis as to the basis of this phenomenon, although the mechanistic basis for the more severe phenotype in the Kcnc1-A421V/+ mouse relative to the Kv3.1 knockout is not entirely clear. Our physiology results and the evidence presented supporting a trafficking impairment, are consistent with dominant negative action of the Kv3.1 A421V variant at the level of channel gating and/or trafficking. To restate, we think the Kcnc1-A421V/+ heterozygous variant is more severe than a Kv3.1 knockout for (at least) three reasons: variant Kv3.1 is incorporated into Kv3.1/Kv3.2 heterotetramers to (1) impair trafficking to the membrane as well as (2) alter the electrophysiological function of those channels that do successfully traffic to the membrane (while Kv3.1 knockout affects Kv3.1 only), and (3) the heterozygous variant may escape compensatory upregulation of Kv3.2 and which is known to occur in Kv3.1 knockout mice.

      For example, our data suggests and is consistent with the view that heterotetramers of WT Kv3.1 and Kv3.2 potentially come together with the A421V Kv3.1 subunit in the endoplasmic reticulum and then fail to traffic to the membrane due to the presence of one or more A421V subunit(s), as evidenced by increased Kv3.1 staining in the cytosol in the Kcnc1-A421V/+ mouse relative to WT. This is in contrast to what would occur in the Kv3.1knockout mice as there is no subunit produced from the null allele to impair WT Kv3.2 subunits from forming fully functional Kv3.2 homotetramers to then reach the cell surface and function properly. This is one specific possible mechanism for dominant negative activity.

      A non-mutually-exclusive mechanism is that inclusion of one or more Kv3.1 A421V subunits into Kv3 heterotetramers impairs gating and prevents potassium flux such that, even if the tetramer does reach the membrane, that entire tetramer fails to contribute to the total potassium current. This is another possible mechanism for dominant negative function of the A421V subunit.

      Experimental elucidation of the precise mechanism of the dominant negative activity of the A421V Kcnc1 variant is beyond the scope of this study; yet, our lab is continuing to work on this. It will likely require dose-response experiments in which various ratios of WT and Kv3.1 A421V subunits are co-expressed in heterologous cells and then recorded for an overall effect on potassium current similar to (Clatot et al., 2017).

      In the revised manuscript, we have updated our discussion of these mechanistic considerations for KCNC1-related epilepsy syndromes in lines 868-883 in the Discussion. 

      References

      Cameron JM et al. (2019) Encephalopathies with KCNC1 variants: genotype-phenotypefunctional correlations. Annals of Clinical and Translational Neurology 6:1263– 1272.

      Clatot J, Hoshi M, Wan X, Liu H, Jain A, Shinlapawittayatorn K, Marionneau C, Ficker E, Ha T, Deschênes I (2017) Voltage-gated sodium channels assemble and gate as dimers. Nature Communications 8.

      Makinson CD, Tanaka BS, Sorokin JM, Wong JC, Christian CA, Goldin AL, Escayg A, Huguenard JR (2017) Regulation of Thalamic and Cortical Network Synchrony by Scn8a. Neuron 93:1165-1179.e6.

      Oliver KL et al. (2017) Myoclonus epilepsy and ataxia due to KCNC1 mutation: Analysis of 20 cases and K+ channel properties. Annals of Neurology 81.

      Park J et al. (2019) KCNC1-related disorders: new de novo variants expand the phenotypic spectrum. Annals of Clinical and Translational Neurology 6:1319–1326.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) A detailed comparison between this work and the work of Sun et al. on experimental protocols and reagents in the main text will be beneficial for readers to assess critically.

      We have added a Key Reagents Table outlining the key reagents used in our study. In terms of experimental protocols, we replicated those described by Sun et al. in most instances and described any differences when present. With this resubmission, we included additional ZnMP accumulation experiments in liquid media (see point 3 below).

      (2) The GaPP used by Sun et al. (purchased from Frontier Scientific) is more effective in killing the worm than the one used in this study (purchased from Santa Cruz). Is the different outcome due to the differences in reagents? Moreover, Sun et al. examined the lethality after 3-4 days, while this work examined the lethality after 72 hours. Would the extra 24 hours make any difference in the result?

      We now cite product vender differences as a possible reason for the observed difference in worm death, as the reviewer suggests, on page 8 (see text below) and include these differences in the Key Reagents Table. We also now stress the fact that our experiments included different doses of GaPP and the use of eat-2 mutants as an additional control, which we believe adds rigor and demonstrates the potency of GaPP in our experiments. We decided on assessment at 72 hours, as we deemed it a less nebulous time point as compared to 3-4 days. Most of the observed worm death occurred earlier in this interval, so we believe it is unlikely that large group differences would emerge after an additional 24 hours.

      “Exposing worms to GaPP, a toxic heme analog, we observed that nematodes deficient in HRG-9 and HRG-10 displayed increased survival compared to WT worms, consistent with prior work,[13] though the between-group difference was markedly smaller in our study. We required higher GaPP concentrations to induce lethality, potentially due to product vendor differences, but did observe a clear dose-dependent effect across strains. Although it was previously proposed that the survival benefit seen in worms lacking HRG-9 and HRG-10 resulted from reduced transfer from intestinal cells after GaPP ingestion, our data suggest the reduced lethality is more likely due to decreased environmental GaPP uptake. Supporting this notion, DKO worms exhibited lawn avoidance, reduced pharyngeal pumping, and modestly lower intestinal ZnMP accumulation when exposed to this fluorescent heme analog on agar plates. In liquid media, DKO worms demonstrated higher fluorescence, but only in ZnMP-free conditions, suggesting the presence of gut granule autofluorescence. Furthermore, survival following exposure to GaPP was highest in eat-2 mutants, despite heme trafficking being unaffected in this strain.”

      (3) This work reported the opposite result of Sun et al. for the fluorescent ZnMP accumulation assay. However, the experimental protocols used by the two studies are massively different. Sun et al. did the ZnMP staining by incubating the L4-stage worms in an axenic mCeHR2 medium containing 40 μM ZnMP (purchased from Frontier Scientific) and 4 μM heme at 20 ℃ for 16 h, while this work placed the L4-stage worms on the OP50 E. coli seeded NGM plates treated with 40 μM ZnMP (purchased from Santa Cruz) for 16 h. The liquid axenic mCeHR2 medium is bacteria-free, heme-free, and consistent for ZnMP uptake by worms. This work has mentioned that the hrg-9 hrg-10 double null mutant has bacterial lawn avoidance and reduced pharyngeal pumping phenotypes. Therefore, the ZnMP staining protocol used in this work faces challenges in the environmental control for the wild type vs. the mutant. The authors should adopt the ZnMP staining protocol used by Sun et al. for a proper evaluation of fluorescent ZnMP accumulation.

      We agree with this comment. As such, we performed the ZnMP assay in liquid media conditions, as now described on page 13:

      “For liquid media experiments, three generations of worms were cultured in regular heme (20 uM) axenic media, with the first two generations receiving antibiotic-supplemented media (10 mg/ml tetracycline) and the 3<sup>rd</sup> generation cultivated without antibiotic. L4 worms from the 3<sup>rd</sup> generation were placed in media containing 40uM ZnMP for 16 hours before being prepared and mounted for imaging as above. Worms were imaged on Zeiss Axio Imager 2 at 40x magnification, with image settings kept uniform across all images. Fluorescent intensity was measured within the proximal region of the intestine using ImageJ.”

      In heme-free media, both WT and DKO worms invariably entered L1 arrest, thus we were not able to replicate the results reported by Sun et al. Using media containing heme, we did see an increase in fluorescence, but this was only in the ZnMP-free condition, indicating that the increased signal was attributable to autofluorescence. This is a known phenomenon associated with gut granules in C. elegans in the setting of oxidative stress. The results of these experiments are now summarized on page 6:

      “DKO nematodes at the L4 larval stage were previously shown to accumulate the fluorescent heme analog zinc mesoporphyrin IX (ZnMP) in intestinal cells in low-heme (4 µM) liquid media. While attempting to replicate this experiment, we observed that both wildtype and DKO nematodes entered L1 arrest under these conditions. Therefore, to allow for developmental progression, we grew worms on standard OP50 E. coli plates and in media containing physiological levels of heme (20 µM). We then examined whether differences in ZnMP uptake persisted under these basal conditions. DKO worms grown on ZnMP-treated E. coli plates displayed significantly reduced intestinal ZnMP fluorescence compared to N2 (Figure 1B and C). Using basal heme media with ZnMP, there was no significant difference in ZnMP fluorescence between DKO and wildtype nematodes, although DKO worms grown in media without ZnMP exhibited significantly higher autofluorescence (Figure 1D and E). To test whether autofluorescence may have contributed to the higher fluorescent intensities previously reported in heme-deficient DKO worms, we repeated this experiment on agar plates under starved conditions but did not observe a difference between groups (Figure 1B).”

      (4) A striking difference between the two studies is that Sun et al. emphasize the biochemical function of TANGO2 homologs in heme transporting with evidence from some biochemical tests. In contrast, this work emphasizes the physiological function of TANGO2 homologs with evidence from multiple phenotypical observations. In the discussion part, the authors should address whether these observed phenotypes in this study can be due to the loss of heme transporting activities upon eliminating TANGO2 homologs. This action can improve the merit of academic debate and collaboration.

      Thank you for this suggestion. The following text has been added to the Discussion section (page 9):

      “In addition to altered pharyngeal pumping, DKO worms displayed multiple previously unreported phenotypic features, suggesting a broader metabolic impairment and reminiscent of some clinical manifestations observed in patients with TDD. Elucidating the mechanisms underlying this phenotype, and whether they reflect a core bioenergetic defect, is an active area of investigation in our lab. Several C. elegans heme-responsive genes have been characterized, revealing relatively specific defects in heme uptake or utilization rather than broad organismal dysfunction. For example, hrg-1 and hrg-4 mutants exhibit impaired growth only under heme-limited conditions,[23] and hrg-3 loss affects brood size and embryonic viability specifically when maternal heme is scarce.[24] ]By contrast, hrg-9 and hrg-10 mutants exhibit the most severe organismal phenotypes of the hrg family, to date, including reduced pharyngeal pumping, decreased motility, shortened lifespan, and smaller broods, even when fed a heme-replete diet.”

      Reviewer #2 (Public review):

      (1) The manuscript is written mainly as a criticism of a previously published paper. Although reproducibility in science is an issue that needs to be acknowledged, a manuscript should focus on the new data and the experiments that can better prove and strengthen the new claims.

      Thank you for this suggestion. While the primary intent of this study was to replicate key findings from the 2022 publication by Sun et al., the revised manuscript now emphasizes underlying mechanisms more broadly rather than focusing narrowly on that prior publication.

      (2) The current presentation of the logic of the study and its results does not help the authors deliver their message, although they possess great potential.

      We have attempted to rectify this through substantial revision of the Discussion section and other places throughout the manuscript.

      (3) The study is missing experiments to link hrg-9 and hrg-10 more directly to bioenergetic and oxidative stress pathways.

      The reviewer is correct in this assertion, but it was not our intent to definitively prove this link or, indeed, the primary mechanism of TANGO2 in the present manuscript. This said, we are actively engaged in this endeavor in our lab and anticipate these data will be published in a separate, forthcoming publication.

      We have added additional references pertaining to hrg-9 enrichment as part of the mitochondrial unfolded protein response (page 10) and a comparison of the phenotype observed in hrg-9 and hrg-10 deficient worms versus those lacking other proteins in the hrg family (page 9).

      Reviewer #3 (Public review):

      (1) The authors stress - with evidence provided in this paper or indicated in the literature - that the primary role of TANGO2 and its homologues is unlikely to be related to heme trafficking, arguing that observed effects on heme transport are instead downstream consequences of aberrant cellular metabolism. But in light of a mounting body of evidence (referenced by the authors) connecting more or less directly TANGO2 to heme trafficking and mobilization, it is recommended that the authors comment on how they think TANGO2 could relate to and be essential for heme trafficking, albeit in a secondary, moonlighting capacity. This would highlight a seemingly common theme in emerging key players in intracellular heme trafficking, as it appears to be the case for GAPDH - with accumulating evidence of this glycolytic enzyme being critical for heme delivery to several downstream proteins.

      TANGO2 is essential for mitochondrial health, albeit in a yet unknown capacity. In the absence of TANGO2, defects in heme trafficking may be secondary sequelae of mitochondrial dysfunction. We would point out that prior studies that attempted to show that TANGO2 and its homologs are involved in heme trafficking proposed very different mechanisms (direct binding vs. membrane protein interaction) and relied on artificially low or high heme conditions to produce these effects. We have attempted to address these more clearly in the Discussion section and have added a fifth figure to summarize our current unifying theory for how heme levels and mitochondrial stress may be linked.

      (2) The observation - using eat-2 mutants and lawn avoidance behaviour - that survival patterns can be partially explained by reduced consumption, is fascinating. It would be interesting to quantify the two relative contributions.

      We have completed additional ZnMP experiments in liquid media at the reviewers’ request. This experimental condition eliminates lawn avoidance as a factor in consumption. Fluorescent intensity was significantly higher in the DKO worms in media lacking ZnMP, indicating increased autofluorescence in DKO worms, while signal was not significantly different in media with ZnMP.

      (3) In the legend to Figure 1A it's a bit unclear what the differently coloured dots represent for each condition. Repeated measurements, worms, independent experiments? The authors should clarify this.

      The following sentence has been added to the legend for Figure 1:

      “Each dot represents the number of offspring laid by one adult worm on one GaPP-treated plate after 24 hours.”

      (4) It would help if the entire fluorescence images (raw and processed) for the ZnMP treatments were provided. Fluorescence images would also benefit Figure 1B.

      Fluorescent intensity values pertaining to the ZnMP experiments are included in our Extended Data supplement, and we have added representative images to Figure 1, per the reviewer’s request. We thank the reviewer for this helpful suggestion. We would be happy to upload raw images to an open-access repository if deemed necessary by the editorial team.

      (5) Increasingly, the understanding of heme-dependent roles relies on transient or indirect binding to unsuspected partners, not necessarily relying on a tight affinity and outdating the notion of heme as a static cofactor. Despite impressive recent advancements in the detection of these interactions (for example https://doi.org/10.1021/jacs.2c06104; cited by the authors), a full characterisation of the hemome is still elusive. Sandkuhler et al. deemed it possible but seem to question that heme binding to TANGO2 occurs. However, Sun et al. convincingly showed and characterised TANGO2 binding to heme. It is recommended that the authors comment on this.

      We believe it is plausible that TANGO2 binds heme (as do hundreds of other proteins), especially as it has been shown to bind other hydrophobic molecules. However, we also note that a separate paper examining the role of TANGO2 in heme transport posited that GAPDH is the sole heme binding partner for cytoplasmic transport (https://doi.org/10.1038/s41467-025-62819-2), contradicting the originally posited theory of how TANGO2 functions. This is described in the Discussion section and, as noted above, we have added an additional figure to demonstrate our unifying hypothesis for why TANGO2 may be important in the low-heme state, irrespective of any direct effect on heme trafficking.

      Additional comments and revisions:

      (1) It was suggested that a triple mutant (eat-2; hrg-9; hrg-10) be tested to determine the primary driver of GaPP toxicity. We appreciate this suggestion, but we offer the following rationale for why these experiments were not pursued. The eat-2 mutant, which lacks a nicotinic acetylcholine receptor subunit in pharyngeal muscles, was included solely as a dietary restriction control to illustrate that reduced GaPP toxicity in the hrg-9/10 double mutant could arise from poor feeding rather than defective heme transport. Both eat-2 and hrg-9/10 mutants exhibit markedly reduced feeding but via different mechanisms. In our assays, GaPP survival was inversely correlated with ingestion rate: eat-2 animals, which feed the least, showed the highest survival, while hrg-9/10 mutants showed intermediate feeding and intermediate survival. Consistent with this, eat-2 worms also displayed the lowest ZnMP accumulation.

      (2) GaPP solution was added to NGM plates after seeding with OP50. This is now expressly stated in the Methods section (page 15). We would note that Sun et al. mixed GaPP in with NGM in the liquid phase. We would expect that if there were a difference in GaPP exposure due to these different protocols, worms in our experiment would have received higher GaPP concentrations.

      “Standard NGM plates were treated with 1, 2, 5, or 10 µM gallium protoporphyrin IX (GaPP; Santa Cruz) after seeding with OP50. Plates were swirled to ensure an even distribution of GaPP and allowed to dry completely.

      (3) The manuscript has been reworked to read as more of an independent study rather than a rebuttal of prior work, though the primary objective of validating prior work remains unchanged.

      (4) Several technical details of experiments have been moved from the main text to the materials and methods section.

      (5) One reviewer noted that the figure numbering should be adjusted. Numbering does not progress sequentially (i.e., 1A…1B…2A…2B) early in the text, because we have opted to consolidate data pertaining to heme analog experiments in Figure 1 and behavioral data in Figure 2.

      (6) “Kingdoms” has been changed to “domains” (page 4).

      (7) Example images are now included for Figure 1B, as noted above.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment:

      This study introduces an important approach using selection linked integration (SLI) to generate Plasmodium falciparum lines expressing single, specific surface adhesins PfEMP1 variants, enabling precise study of PfEMP1 trafficking, receptor binding, and cytoadhesion. By moving the system to different parasite strains and introducing an advanced SLI2 system for additional genomic edits, this work provides compelling evidence for an innovative and rigorous platform to explore PfEMP1 biology and identify novel proteins essential for malaria pathogenesis including immune evasion.

      Reviewer #1 (Public review):

      One of the roadblocks in PfEMP1 research has been the challenges in manipulating var genes to incorporate markers to allow the transport of this protein to be tracked and to investigate the interactions taking place within the infected erythrocyte. In addition, the ability of Plasmodium falciparum to switch to different PfEMP1 variants during in vitro culture has complicated studies due to parasite populations drifting from the original (manipulated) var gene expression. Cronshagen et al have provided a useful system with which they demonstrate the ability to integrate a selectable drug marker into several different var genes that allows the PfEMP1 variant expression to be 'fixed'. This on its own represents a useful addition to the molecular toolbox and the range of var genes that have been modified suggests that the system will have broad application. As well as incorporating a selectable marker, the authors have also used selective linked integration (SLI) to introduce markers to track the transport of PfEMP1, investigate the route of transport, and probe interactions with PfEMP1 proteins in the infected host cell.

      What I particularly like about this paper is that the authors have not only put together what appears to be a largely robust system for further functional studies, but they have used it to produce a range of interesting findings including:

      Co-activation of rif and var genes when in a head-to-head orientation.

      The reduced control of expression of var genes in the 3D7-MEED parasite line.

      More support for the PTEX transport route for PfEMP1.

      Identification of new proteins involved in PfEMP1 interactions in the infected erythrocyte, including some required for cytoadherence.

      In most cases the experimental evidence is straightforward, and the data support the conclusions strongly. The authors have been very careful in the depth of their investigation, and where unexpected results have been obtained, they have looked carefully at why these have occurred.

      We thank the reviewer for the kind assessment and the comments to improve the paper.

      (1) In terms of incorporating a drug marker to drive mono-variant expression, the authors show that they can manipulate a range of var genes in two parasite lines (3D7 and IT4), producing around 90% expression of the targeted PfEMP1. Removal of drug selection produces the expected 'drift' in variant types being expressed. The exceptions to this are the 3D7-MEED line, which looks to be an interesting starting point to understand why this variant appears to have impaired mutually exclusive var gene expression and the EPCR-binding IT4var19 line. This latter finding was unexpected and the modified construct required several rounds of panning to produce parasites expressing the targeted PfEMP1 and bind to EPCR. The authors identified a PTP3 deficiency as the cause of the lack of PfEMP1 expression, which is an interesting finding in itself but potentially worrying for future studies. What was not clear was whether the selected IT4var19 line retained specific PfEMP1 expression once receptor panning was removed.

      We do not have systematic long-term data for the Var19 line but do have medium-term data. After panning the Var19 line, the binding assays were done within 3 months without additional panning. The first binding assay was 2 months after the panning and the last binding assays three weeks later, totaling about 3 months without panning. While there is inherent variation in these assays that precludes detection of smaller changes, the last assay showed the highest level of binding, giving no indication for rapid loss of the binding phenotype. Hence, we can say that the binding phenotype appears to be stable for many weeks without panning the cells again and there was no indication for a rapid loss of binding in these parasites.

      Systematic long-term experiments to assess how long the Var19 parasites retain binding would be interesting, but given that the binding-phenotype appears to remain stable over many weeks or even months, this would only make sense if done over a much longer time frame. Such data might arise if the line is used over extended times for a specific project in which case it might be advisable to monitor continued binding. We included a statement in the discussion that the binding phenotype was stable over many weeks but that if long-term work with this line is planned, monitoring the binding phenotype might be advisable: “In the course of this work the binding phenotype of the IT4var19 expressor line remained stable over many weeks without further panning. However, given that initial panning had been needed for this particular line, it might be advisable for future studies to monitor the binding phenotype if the line is used for experiments requiring extended periods of cultivation.”

      (2) The transport studies using the mDHFR constructs were quite complicated to understand but were explained very clearly in the text with good logical reasoning.

      We are aware of this being a complex issue and are glad this was nevertheless understandable.

      (3) By introducing a second SLI system, the authors have been able to alter other genes thought to be involved in PfEMP1 biology, particularly transport. An example of this is the inactivation of PTP1, which causes a loss of binding to CD36 and ICAM-1. It would have been helpful to have more insight into the interpretation of the IFAs as the anti-SBP1 staining in Figure 5D (PTP-TGD) looks similar to that shown in Figure 1C, which has PTP intact. The anti-EXP2 results are clearly different.

      We realize the description of the PTP1-TGD IFA data and that of the other TGDs (see also response to Recommendation to authors point 4 and reviewer 2, major points 6 and 7) was rather cursory. The previously reported PTP1 phenotype is a fragmentation of the Maurer’s clefts into what in IFA appear to be many smaller pieces (Rug et al 2014, referenced in the manuscript). The control in Fig. 5D has 13 Maurer’s cleft spots (previous work indicates an average of ~15 MC per parasite, see e.g. the originally co-submitted eLife preprint doi.org/10.7554/eLife.103633.1 and references therein). The control mentioned by the reviewer in Fig. 1C has about 22 Maurer’s clefts foci, at the upper end of the typical range, but not unusual. In contrast, the PTP1-TGD in Fig. 5D, has more than 30 foci with an additional cytoplasmic pool and additional smaller, difficult to count foci. This is consistent with the published phenotype in Rug et al 2014. The EXP1 stained cell has more than 40 Maurer’s cleft foci, again beyond what typically is observed in controls. Therefore, these cells show a difference to the control in Fig. 5 but also to Fig. 1C. Please note that we are looking at two different strains, in Fig. 1 it is 3D7 and in Fig. 5 IT4. While we did not systematically assess this, the Maurer’s clefts number per cell seemed to be largely comparable between these strains (Fig. 10C and D in the other eLife preprint doi.org/10.7554/eLife.103633.1). 

      Overall, as the PTP1 loss phenotype has already been reported, we did not go into more experimental detail. However, we now modified the text to more clearly describe how the phenotype in the PTP1-TGD parasites was different to control: “IFAs showed that in the PTP1-TGD parasites, SBP1 and PfEMP1 were found in many small foci in the host cell that exceeded the average number of ~ 15 Maurer’s clefts typically found per infected RBC [66] (Fig. 5D). This phenotype resembled the previously reported Maurer’s clefts phenotype of the PTP1 knock out in CS2 parasites [39].”

      (4) It is good to see the validation of PfEMP1 expression includes binding to several relevant receptors. The data presented use CHO-GFP as a negative control, which is relevant, but it would have been good to also see the use of receptor mAbs to indicate specific adhesion patterns. The CHO system if fine for expression validation studies, but due to the high levels of receptor expression on these cells, moving to the use of microvascular endothelial cells would be advisable. This may explain the unexpected ICAM-1 binding seen with the panned IT4var19 line.

      We agree with the reviewer that it is desirable to have better binding systems for studying individual binding interactions. As the main purpose of this paper was to introduce the system and provide proof of principle that the cells show binding, we did not move to more complicated binding systems. However, we would like to point out that the CSA binding was done on receptor alone in addition to the CSA-expressing HBEC-5i cells and was competed successfully with soluble CSA. In addition, apart from the additional ICAM1-binding of the Var19 line, all binding phenotypes were conform with expectations. We therefore hope the tools used for binding studies are acceptable at this stage of introducing the system while future work interested in specific PfEMP1 receptor interactions may use better systems, tailored to the specific question (e.g. endothelial organoid models and engineered human capillaries and inhibitory antibodies or relevant recombinant domains for competition).

      (5) The proxiome work is very interesting and has identified new leads for proteins interacting with PfEMP1, as well as suggesting that KAHRP is not one of these. The reduced expression seen with BirA* in position 3 is a little concerning but there appears to be sufficient expression to allow interactions to be identified with this construct. The quantitative impact of reduced expression for proxiome experiments will clearly require further work to define it.

      This is a valid point. Clearly there seems to be some impact on binding when BirA* is placed in the extracellular domain (either through reduced presentation or direct reduction of binding efficiency of the modified PfEMP1; please see also minor comment 10 reviewer 2). The exact quantitative impact on the proxiome is difficult to assess but we note that the relative enrichment of hits to each other is rather similar to the other two positions (Fig. 6H-J). We therefore believe the BioIDs with the 3 PfEMP1-BirA* constructs are sufficient to provide a general coverage of proteins proximal to PfEMP1 and hope this will aid in the identification of further proteins involved in PfEMP1 transport and surface display as illustrated with two of the hits targeted here.

      The impact of placing a domain on the extracellular region of PfEMP1 will have to be further evaluated if needed in other studies. But the finding that a large folded domain can be placed into this part at all, even if binding was reduced, in our opinion is a success (it was not foreseeable whether any such change would be tolerated at all).

      (6) The reduced receptor binding results from the TryThrA and EMPIC3 knockouts were very interesting, particularly as both still display PfEMP1 on the surface of the infected erythrocyte. While care needs to be taken in cross-referencing adhesion work in P. berghei and whether the machinery truly is functionally orthologous, it is a fair point to make in the discussion. The suggestion that interacting proteins may influence the "correct presentation of PfEMP1" is intriguing and I look forward to further work on this.

      We hope future work will be able to shed light on this.

      Overall, the authors have produced a useful and reasonably robust system to support functional studies on PfEMP1, which may provide a platform for future studies manipulating the domain content in the exon 1 portion of var genes. They have used this system to produce a range of interesting findings and to support its use by the research community. Finally, a small concern. Being able to select specific var gene switches using drug markers could provide some useful starting points to understand how switching happens in P. falciparum. However, our trypanosome colleagues might remind us that forcing switches may show us some mechanisms but perhaps not all.

      Point noted! From non-systematic data with the Var01 line that has been cultured for extended periods of time (several years), it seems other non-targeted vars remain silent in our SLI “activation” lines but how much SLI-based var-expression “fixing” tampers with the integrity of natural switching mechanisms is indeed very difficult to gage at this stage. We now added a statement to the discussion that even if mutually exclusive expression is maintained, it is not certain the mechanisms controlling var expression all remain intact: “However, it should be noted that it is not known whether all mechanisms controlling mutually exclusive expression and switching remain intact in parasites with SLI-activated var genes.”

      Reviewer #2 (Public review):

      Summary

      Croshagen et al develop a range of tools based on selection-linked integration (SLI) to study PfEMP1 function in P. falciparum. PfEMP1 is encoded by a family of ~60 var genes subject to mutually exclusive expression. Switching expression between different family members can modify the binding properties of the infected erythrocyte while avoiding the adaptive immune response. Although critical to parasite survival and Malaria disease pathology, PfEMP1 proteins are difficult to study owing to their large size and variable expression between parasites within the same population. The SLI approach previously developed by this group for genetic modification of P. falciparum is employed here to selectively and stably activate the expression of target var genes at the population level. Using this strategy, the binding properties of specific PfEMP1 variants were measured for several distinct var genes with a novel semi-automated pipeline to increase throughput and reduce bias. Activation of similar var genes in both the common lab strain 3D7 and the cytoadhesion competent FCR3/IT4 strain revealed higher binding for several PfEMP1 IT4 variants with distinct receptors, indicating this strain provides a superior background for studying PfEMP1 binding. SLI also enables modifications to target var gene products to study PfEMP1 trafficking and identify interacting partners by proximity-labeling proteomics, revealing two novel exported proteins required for cytoadherence. Overall, the data demonstrate a range of SLI-based approaches for studying PfEMP1 that will be broadly useful for understanding the basis for cytoadhesion and parasite virulence.

      We thank the reviewer for the kind assessment and the comments to improve the paper.

      Comments

      (1) While the capability of SLI to actively select var gene expression was initially reported by Omelianczyk et al., the present study greatly expands the utility of this approach. Several distinct var genes are activated in two different P. falciparum strains and shown to modify the binding properties of infected RBCs to distinct endothelial receptors; development of SLI2 enables multiple SLI modifications in the same parasite line; SLI is used to modify target var genes to study PfEMP1 trafficking and determine PfEMP1 interactomes with BioID. Curiously, Omelianczyk et al activated a single var (Pf3D7_0421300) and observed elevated expression of an adjacent var arranged in a head-to-tail manner, possibly resulting from local chromatin modifications enabling expression of the neighboring gene. In contrast, the present study observed activation of neighboring genes with head-to-head but not head-totail arrangement, which may be the result of shared promoter regions. The reason for these differing results is unclear although it should be noted that the two studies examined different var loci.

      The point that we are looking at different loci is very valid and we realize this is not mentioned in the discussion. We now added to the discussion that it is unclear if our results and those cited may be generalized and that different var gene loci may respond differently

      “However, it is unclear if this can be generalized and it is possible that different var loci respond differently.”

      (2) The IT4var19 panned line that became binding-competent showed increased expression of both paralogs of ptp3 (as well as a phista and gbp), suggesting that overexpression of PTP3 may improve PfEMP1 display and binding. Interestingly, IT4 appears to be the only known P. falciparum strain (only available in PlasmoDB) that encodes more than one ptp3 gene (PfIT_140083100 and PfIT_140084700). PfIT_140084700 is almost identical to the 3D7 PTP3 (except for a ~120 residue insertion in 3D7 beginning at residue 400). In contrast, while the C-terminal region of PfIT_140083100 shows near-perfect conservation with 3D7 PTP3 beginning at residue 450, the N-terminal regions between the PEXEL and residue 450 are quite different. This may indicate the generally stronger receptor binding observed in IT4 relative to 3D7 results from increased PTP3 activity due to multiple isoforms or that specialized trafficking machinery exists for some PfEMP1 proteins.

      We thank the reviewer for pointing this out, the exact differences between the two PTP3s of IT4 and that of other strains definitely should be closely examined if the function of these proteins in PfEMP1 binding is analysed in more detail. 

      It is an interesting idea that the PTP3 duplication could be a reason for the superior binding of IT4. We always assumed that IT4 had better binding because it was less culture adapted but this does not preclude that PTP3(s) is(are) a reason for this. However, at least in our 3D7 PTP3 can’t be the reason for the poor binding, as our 3D7 still has PfEMP1 on the surface while in the unpanned IT4-Var19 line and in the Maier et al., Cell 2008 ptp3 KO (PMID: 18614010)) PfEMP1 is not on the surface anymore. 

      Testing the impact of having two PTP3s would be interesting, but given the “mosaic” similarity of the two PTP3s isoforms, a simple add-on experiment might not be informative. Nevertheless, it will be interesting in future work to explore this in more detail.

      Reviewer #3 (Public review):

      Summary:

      The submission from Cronshagen and colleagues describes the application of a previously described method (selection linked integration) to the systematic study of PfEMP1 trafficking in the human malaria parasite Plasmodium falciparum. PfEMP1 is the primary virulence factor and surface antigen of infected red blood cells and is therefore a major focus of research into malaria pathogenesis. Since the discovery of the var gene family that encodes PfEMP1 in the late 1990s, there have been multiple hypotheses for how the protein is trafficked to the infected cell surface, crossing multiple membranes along the way. One difficulty in studying this process is the large size of the var gene family and the propensity of the parasites to switch which var gene is expressed, thus preventing straightforward gene modification-based strategies for tagging the expressed PfEMP1. Here the authors solve this problem by forcing the expression of a targeted var gene by fusing the PfEMP1 coding region with a drug-selectable marker separated by a skip peptide. This enabled them to generate relatively homogenous populations of parasites all expressing tagged (or otherwise modified) forms of PfEMP1 suitable for study. They then applied this method to study various aspects of PfEMP1 trafficking.

      Strengths:

      The study is very thorough, and the data are well presented. The authors used SLI to target multiple var genes, thus demonstrating the robustness of their strategy. They then perform experiments to investigate possible trafficking through PTEX, they knock out proteins thought to be involved in PfEMP1 trafficking and observe defects in cytoadherence, and they perform proximity labeling to further identify proteins potentially involved in PfEMP1 export. These are independent and complimentary approaches that together tell a very compelling story.

      We thank the reviewer for the kind assessment and the comments to improve the paper.

      Weaknesses:

      (1)  When the authors targeted IT4var19, they were successful in transcriptionally activating the gene, however, they did not initially obtain cytoadherent parasites. To observe binding to ICAM-1 and EPCR, they had to perform selection using panning. This is an interesting observation and potentially provides insights into PfEMP1 surface display, folding, etc. However, it also raises questions about other instances in which cytoadherence was not observed. Would panning of these other lines have been successfully selected for cytoadherent infected cells? Did the authors attempt panning of their 3D7 lines? Given that these parasites do export PfEMP1 to the infected cell surface (Figure 1D), it is possible that panning would similarly rescue binding. Likewise, the authors knocked out PTP1, TryThrA, and EMPIC3 and detected a loss of cytoadhesion, but they did not attempt panning to see if this could rescue binding. To ensure that the lack of cytoadhesion in these cases is not serendipitous (as it was when they activated IT4var19), they should demonstrate that panning cannot rescue binding.

      These are very important considerations. Indeed, we had repeatedly attempted to pan 3D7 when we failed to get the SLI-generated 3D7 PfEMP1 expressor lines to bind, but this had not been successful. The lack of binding had been a major obstacle that had held up the project and was only solved when we moved to IT4 which readily bound (apart from Var19 which was created later in the project). After that we made no further efforts to understand why 3D7 does not bind but the fact that PfEMP1 is on the surface indicates this is not a PTP3 issue because loss of PTP3 also leads to loss of PfEMP1 surface display. Also, as the parent 3D7 could not be panned, we assumed this issue is not easily fixed in the SLI var lines we made in 3D7.

      Panning the TGD lines: we see the reasoning for conducting panning experiments with the TGD lines. However, on second thought, we are unsure this should be attempted. The outcome might not be easily interpretable as at least two forces will contribute to the selection in panning experiments with TGD lines that do not bind anymore:

      Firstly, panning would work against the SLI of the TGD, resulting in a tug of war between the TGD-SLI and binding. This is because a small number of parasites will loop out the TGD plasmid (revert) and would normally be eliminated during standard culturing due to the SLI drug used for the TGD. These revertant cells would bind and the panning would enrich them. Hence, panning and SLI are opposed forces in the case of a TGD abolishing binding. It is unclear how strong this effect would be, but this would for sure lead to mixed populations that complicate interpretations. 

      The second selecting force are possible compensatory changes to restore binding. These can be due to different causes: (i) reversal of potential independent changes that may have occurred in the TGD parasites and that are in reality causing the binding loss (i.e. such as ptp3 loss or similar, the concern of the reviewer) or (ii) new changes to compensate the loss of the TGD target (in this case the TGD is the cause of the binding loss but for instance a different change ameliorates it by for instance increasing PfEMP1 expression or surface display). As both TGDs show some residual binding and have VAR01 on the surface to at least some extent, it is possible that new compensatory changes might indeed occur that indirectly increase binding again. 

      In summary, even if more binding occurs after panning of the lines, it is not clear whether this is due to a compensatory change ameliorating the TGD or reversal of an unrelated change or are counter-selections against the SLI. To determine the cause, the panned TGD lines would need to be subjected to a complex and time-consuming analysis (WGS, RNASeq, possibly Maurer’s clefts phenotype) to find out whether they were SLI-revertants, or had an unrelated chance that was reverted or a new compensatory change that helps binding. This might be further muddled if a mix of cells come out of the selection that have different changes of the options indicated above. In that case, it might even require scRNASeq to make sense of the panning experiment. Due to the envisaged difficulty in interpreting the outcome, we did not attempt this panning.

      To exclude loss of ptp3 expression as the reason for binding loss (something we would not have seen in the WGS if it is only due to a transcriptional change), we now carried out RNASeq with the TGD lines that have a binding phenotype. While we did not generate replicas to obtain quantitative data, the results show that both ptp3 copies were expressed in these TGDs comparable to other parasite lines that do bind with the same SLI-activated var gene, indicating that the effect is not due to ptp3 (see response to point 4 on PTP3 expression in the Recommendations for the authors). While we can’t fully exclude other changes in the TGDs that might affect binding, the WGS did not show any obvious alterations that could be responsible for this. 

      (2) The authors perform a series of trafficking experiments to help discern whether PfEMP1 is trafficked through PTEX. While the results were not entirely definitive, they make a strong case for PTEX in PfEMP1 export. The authors then used BioID to obtain a proxiome for PfEMP1 and identified proteins they suggest are involved in PfEMP1 trafficking. However, it seemed that components of PTEX were missing from the list of interacting proteins. Is this surprising and does this observation shed any additional light on the possibility of PfEMP1 trafficking through PTEX? This warrants a comment or discussion.

      This is an interesting point and we agree that this warrants to be discussed. A likely reason why PTEX components are not picked up as interactors is that BirA* is expected to be unfolded when it passes through the channel and in that state can’t biotinylate. Labelling likely would only be possible if PfEMP1 lingered at the PTEX translocation step before BirA* became unfolded to go through the channel which we would not expect under physiological conditions. We added the following sentences to the discussion: “While our data indicates PfEMP1 uses PTEX to reach the host cell, this could be expected to have resulted in the identification of PTEX components in the PfEMP1 proxiomes, which was not the case. However, as BirA* must be unfolded to pass through PTEX, it likely is unable to biotinylate translocon components unless PfEMP1 is stalled during translocation. For this reason, a lack of PTEX components in the PfEMP1 proxiomes does not necessarily exclude passage through PTEX.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Most of my comments are in the public section. I would just highlight a few things:

      (1) In the binding studies section you talk about "human brain endothelial cells (HBEC-5i)". These cells do indeed express CSA but this is a property of their immortalisation rather than being brain endotheliium, which does not express CSA. I think this could be confusing to readers so I think you might want to reword this sentence to focus on CSA expressing the cell line rather than other features.

      We thank the reviewer for pointing this out, we now modified the sentence to focus on the fact these are CSA expressing cells and provided a reference for it.

      (2) As I said in the public section, CHO cells are great for proof of concept studies, but they are not endothelium. Not a problem for this paper.

      Noted! Please also see our response to the public review.

      (3) I wonder whether your comment about how well tolerated the Bir3* insertion is may be a bit too strong. I might say "Nonetheless, overall the BirA* modified PfEMP1 were functional."

      Changed as requested.

      (4) I'm not sure how you explain the IFA staining patterns to the uninitiated, but perhaps you could explain some of the key features you are looking for.

      We apologise for not giving an explanation of the IFA staining patterns in the first place. Please see detailed response to public review of this reviewer (point 3 on PTP1-TGD phenotype) and to reviewer 2 (Recommendations to the authors, points 6 and 7 on better explaining and quantifying the Maurer’s clefts phenotypes). For this we now also generated parasites that episomally express mCherry tagged SBP1 in the TGD parasites with the reduced binding phenotype. This resulted in amendments to Fig. S7, addition of a Fig. S8 and updated results to better explain the phenotypes. 

      This is a great paper - I just wish I'd had this system before.

      Thank you!

      Reviewer #2 (Recommendations for the authors):

      Major Comments

      (1) Does the RNAseq analysis of 3D7var0425800 and 3D7MEEDvar0425800 (Figure 1G, H) reveal any differential gene expression that might suggest a basis for loss of mutually exclusive var expression in the MEED line?

      We now carried out a thorough analysis of these RNASeq experiments to look for an underlying cause for the phenotype. This was added as new Figure 1J and new Table S3. This analysis again illustrated the increased transcript levels of var genes. In addition, it showed that transcripts of a number of other exported proteins, including members of other gene families, were up in the MEED line. 

      One hit that might be causal of the phenotype was sip2, which was down by close to 8-fold (pAdj 0.025). While recent work in P. berghei found this ApiAP2 to be involved in the expression of merozoite genes (Nishi et al., Sci Advances 2025(PMID: 40117352)), previous work in P. falciparum showed that it binds heterochromatic telomere regions and certain var upstream regions (Flück et al., PlosPath 2010 (PMID: 20195509), now cited in the manuscript). The other notable change was an upregulation of the non-coding RNA ruf6 which had been linked with impaired mono-allelic var expression (Guizetti et al., NAR 2016 (PMID: 27466391), now also cited in the manuscript). While it would go beyond this manuscript to follow this up, it is conceivable that alterations in chromosome end biology due to sip2 downregulation or upregulation of ruf6 are causes of the observed phenotype

      We now added a paragraph on the more comprehensive analysis of the RNA Seq data of the MEED vs non-MEED lines at the end of the second results section.

      (2) Could the inability of the PfEMP1-mDHFR fusion to block translocation (Fig 2A) reflect unique features of PfEMP1 trafficking, such as the existence of a soluble, chaperoned trafficking state that is not fully folded? Was a PfEMP1-BPTI fusion ever tested as an alternative to mDHFR?

      This is an interesting suggestion. The PfEMP1-BPTI was never tested. However, a chaperoned trafficking state would likely also affect BPTI. Given that both domains (mDHFR and BPTI) in principle do the same when folded and would block when the construct is in the PV, it is not so likely that using a different blocking domain would make a difference. Therefore, the scenario where BPTI would block when mDHFR does not, is not that probable. The opposite would be possible (mDHFR blocking while BPTI does not, because only the latter depends on the redox state). However, this would only happen if the block  occurred before the construct reaches the PV.

      At present, we believe the lacking block to be due to the organization of the domains in the construct. In the PfEMP1-mDHFR construct in this manuscript the position of the blocking domain is further away from the TMD compared to all other previously tested mDHFR fusions. Increased distance to the TMD has previously been found to be a factor impairing the blocking function of mDHFR (Mesen-Ramirez et al., PlosPath 2016 (PMID: 27168322)). Hence, our suspicion that this is the reason for the lacking block with the PfEMP1-mDHFR rather than the type of blocking domain. However, the latter option can’t be fully excluded and we might test BPTI in future work.

      (3) The late promoter SBP1-mDHFR is 2A fused with the KAHRP reporter. Since 2A skipping efficiency varies between fusion contexts and significant amounts of unskipped protein can be present, it would be helpful to include a WB to determine the efficiency of skipping and provide confidence that the co-blocked KAHRP in the +WR condition (Fig 2D) is not actually fused to the C-terminus of SBP1-mDHFR-GFP.

      Fortunately, this T2A fusion (crt_SBP1-mDHFR-GFP-2A-KAHRP-mScarlet<sup>epi</sup>) was used before in work that included a Western blot showing its efficient skipping (S3 A Fig in MesenRamirez et al., PlosPath 2016). In agreement with these Western blot result, fluorescence microscopy showed very limited overlap of SBP1-mDHFR-GFP and KAHRP-mCherry in absence of WR (Fig. 3B in Mesen-Ramirez et al., PlosPath 2016 and Fig. 2 in this manuscript) which would not be the case if these two constructs were fused together. Please note that KAHRP is known to transiently localize to the Maurer’s clefts before reaching the knobs (Wickham et al., EMBOJ 2001, PMID: 11598007), and therefore occasional overlap with SBP1 at the Maurer’s clefts is expected. However, we would expect much more overlap if a substantial proportion of the construct population would not be skipped and therefore the co-blocked KAHRP-mCherry in the +WR sample is unlikely to be due to inefficient skipping and attachment to SBP1-mDHFR-GFP.

      (4) Does comparison of RNAseq from the various 3D7 and IT4 lines in the study provide any insight into PTP3 expression levels between strains with different binding capacities? Was the expression level of ptp3a/b in the IT4var19 panned line similar to the expression in the parent or other activated IT4 lines? Could the expanded ptp3 gene number in IT4 indicate that specialized trafficking machinery exists for some PfEMP1 proteins (ie, IT4var19 requires the divergent PTP3 paralog for efficient trafficking)?

      PTP3 in the different IT4 lines that bind:

      In those parasite lines that did bind, the intrinsic variation in the binding assays, the different binding properties of different PfEMP1 variants and the variation in RNA Seq experiments to compare different parasite lines precludes a correlation of binding level vs ptp3 expression. For instance, if a PfEMP1 variant has lower binding capacity, ptp3 may still be higher but binding would be lower than if comparing to a parasite line with a better binding PfEMP1 variant. Studying the effect of PTP3 levels on binding could probably be done by overexpressing PTP3 in the same PfEMP1 SLI expressor line and assessing how this affects binding, but this would go beyond this manuscript.

      PTP3 in panned vs unpanned Var19:

      We did some comparisons between IT4 parent, and the IT4-Var19 panned and unpanned

      (see Author response table 1). This did not reveal any clear associations. While the parent had somewhat lower ptp3 transcript levels, they were still clearly higher than in the unpanned Var19 line and other lines had also ptp3 levels comparable to the panned IT4-Var19 (see Author response table 2) 

      PTP3 in the TGDs and possible reason for binding phenotype:

      A key point is whether PTP3 could have influenced the lack of binding in the TGD lines (see also weakness section and point 1 of public review of reviewer 3: ptp3 may be an indirect cause resulting in lacking binding in TGD parasites). We now did RNA Seq to check for ptp3 expression in the relevant TGD lines although we did not do a systematic quantitative comparison (which would require 3 replicates of RNASeq), but we reasoned that loss of expression would also be evident in one replicate. There was no indication that the TGD lines had lost PTP3 expression (see Author response table 2) and this is unlikely to explain the binding loss in a similar fashion to the Var19 parasites. Generally, the IT4 lines showed expression of both ptp3 genes and only in the Var19 parasites before panning were the transcript levels considerably lower:

      Author response table 1.

      Parent vs IT4-Var19 panned and unpanned

      Author response table 2.

      TGD lines with binding phenotype vs parent

      The absence of an influence of PTP3 on the binding phenotype in the cell lines in this manuscript (besides Var19) is further supported by its role in PfEMP1 surface display. Previous work has shown that KO of ptp3 leads to a loss of VAR2CSA surface display (Maier et al., Cell 2008). The unpanned Var19 parasite also lacked PfEMP1 surface display and panning and the resulting appearance of the binding phenotype was accompanied by surface display of PfEMP1. As both, the EMPIC3 and TryThra-TGD lines had still at least some PfEMP1 on the surface, this also (in addition to the RNA Seq above) speaks against PTP3 being the cause of the binding phenotype. The same applies to 3D7 which despite the poor binding displays PfEMP1 on the host cell surface (Figure 1D). This indicating that also the binding phenotype in 3D7 is not due to PTP3 expression loss, as this would have abolished PfEMP1 surface display. 

      The idea about PTP3 paralogs for specific PfEMP1s is intriguing. In the future it might be interesting to test the frequency of parasites with two PTP3 paralogs in endemic settings and correlate it with the PfEMP1 repertoire, variant expression and potentially disease severity. 

      (5) The IT4var01 line shows substantially lower binding in Figure 5F compared with the data shown in Figure 4E and 6F. Does this reflect changes in the binding capacity of the line over time or is this variability inherent to the assay?

      There is some inherent variability in these assays. While we did not systematically assess this, we had no indication that this was due to the parasite line changing. The Var01 line was cultured for months and was frozen down and thawed more than once without a clear gradual trend for more or less binding. While we can’t exclude some variation from the parasite side, we suspect it is more a factor of the expression of the receptor on the CHO cells the iRBCs bind to. 

      Specifically, the assays in Fig. 6F and 4E mentioned by the reviewer both had an average binding to CD36 of around 1000 iE/mm2, only the experiments in Fig. 5F are different (~ 500 iE/mm2) but these were done with a different batch of CHO cells at a different time to the experiments in Fig. 6F and 4E. 

      (6) In Figure S7A, TryThrA and EMPIC3 show distinct localization as circles around the PfEMP1 signal while PeMP2 appears to co-localize with PfEMP1 or as immediately adjacent spots (strong colocalization is less apparent than SBP1, and the various PfEMP1 IFAs throughout the study). Does this indicate that TryThrA and EMPIC3 are peripheral MC proteins? Does this have any implications for their function in PfEMP1 binding? Some discussion would help as these differences are not mentioned in the text. For the EMPIC3 TGD IFAs, localization of SBP1 and PfEMP1 is noted to be normal but REX1 is not mentioned (although this also appears normal).

      We apologise for the lacking description of the candidate localisations and cursory description of the Maurer’s clefts phenotypes (next point). Our original intent was to not distract too much from the main flow of the manuscript as almost every part of the manuscript could be followed up with more details. However, we fully agree that this is unsatisfactory and now provided more description (this point) and more data (next point).

      Localisation of TryThrA and EMPIC3 compared to PfEMP1 at the Maurer’s clefts: the circular pattern is reminiscent of the results with Maurer’s clefts proteins reported by McMillan et al using 3D-SIM in 3D7 parasites (McMillan et al., Cell Microbiology 2014 (PMID: 23421990)). In that work SBP1 and MAHRP1 (both integral TMD proteins) were found in foci but REX1 (no TMD) in circular structures around these foci similar to what we observed here for TryThrA and EMPIC3 which both also lack a TMD. The SIM data in McMillan et al indicated that also PfEMP1 is “more peripheral”, although it did only partially overlap with REX1. The conclusion from that work was that there are sub-compartments at the Maurer’s clefts. In our IFAs (Fig. S7A) PfEMP1 is also only partially overlapping with the TryThrA and EMPIC3 circles, potentially indicating similar subcompartments to those observed by 3D-SIM. We agree with the reviewer that this might be indicative of peripheral MC proteins, fitting with a lack of TMD in these candidates, but we did not further speculate on this in the manuscript.

      We now added enlargements of the ring-like structures to better illustrate this observation in Fig. S7A. In addition, we now specifically mention the localization data and the ring like signal with TryThrA and EMPIC3 in the results and state that this may be similar to the observations by McMillan et al., Cell Microbiology 2014.

      We also thank the reviewer for pointing out that we had forgotten to mention REX1 in the EMPIC3-TGD, this was amended.  

      (7) The atypical localization in TryThrA TGD line claimed for PfEMP1 and SBP1 in Fig S7B is not obvious. While most REX1 is clustered into a few spots in the IFA staining for SBP1 and REX1, SBP1 is only partially located in these spots and appears normal in the above IFA staining for SBP1 and HA. The atypical localization of PfEMP1-HA is also not obvious to me. The authors should clarify what is meant by "atypical" localization and provide support with quantification given the difference between the two SBP1 images shown.

      We apologise for the inadequate description of these IFA phenotypes. The abnormal signal for SBP1, REX1 and PfEMP1 in the TryThrA-TGD included two phenotypes found with all 3 proteins: 

      (1) a dispersed signal for these proteins in the host cell in addition to foci (the control and the other TGD parasites have only dots in the host cell with no or very little detectable dispersed signal). 

      (2) foci of disproportionally high intensity and size, that we assumed might be aggregation or enlargement of the Maurer’s clefts or of the detected proteins.

      The reason for the difference between the REX1 (aggregation) phenotype and the PfEMP1 and SBP1 (dispersed signal, more smaller foci) phenotypes in the images in Fig. S7B is that both phenotypes were seen with all 3 proteins but we chose a REX1 stained cell to illustrate the aggregation phenotype (the SBP1 signal in the same cell is similar to the REX1 signal, illustrating that this phenotype is not REX1 specific; please note that this cell also has a dispersed pool of REX1 and SBP1). 

      Based on the IFAs 66% (n = 106 cells) of the cells in the TryThrA-TGD parasites had one or both of the observed phenotypes. We did not include this into the previous version of the manuscript because a description would have required detouring from the main focus of this results section. In addition, IFAs have some limitations for accurate quantifications, particularly for soluble pools (depending on fixing efficiency and agent, more or less of a soluble pool in the host cell can leak out). 

      To answer the request to better explain and quantify the phenotype and given the limitations of IFA, we now transfected the TryThrA-TGD parasites with a plasmid mediating episomal expression of SBP1-mCherry, permitting live cell imaging and a better classification of the Maurer’s clefts phenotype. Due to the two SLI modifications in these parasites (using up 4 resistance markers) we had to use a new selection marker (mutated lactate transporter PfFNT, providing resistance to BH267.meta (Walloch et al., J. Med. Chem. 2020 (PMID: 32816478))) to transfect these parasites with an additional plasmid. 

      These results are now provided as Fig. S8 and detailed in the last results section. The new data shows that the majority of the TryThrA-TGD parasites contain a dispersed pool of SBP1 in the host cell. About a third of the parasites also showed disproportionally strong SBP1 foci that may be aggregates of the Maurer’s clefts. We also transfected the EMPIC3-TGD parasites with the FNT plasmid mediating episomal SBP1-mCherry expression and observed only few cells with a cytoplasmic pool or aggregates (Fig. S8). Overall these findings agree with the previous IFA results. As the IFA suggests similar results also for REX1 and PfEMP1, this defect is likely not SBP1 specific but more general (Maurer’s clefts morphology; association or transport of multiple proteins to the Maurer’s clefts). This gives a likely explanation for the cytoadherence phenotype in the TryThrA-TGD parasites. The reason for the EMPIC3-TGD phenotype remains to be determined as we did not detect obvious changes of the Maurer’s clefts morphology or in the transport of proteins to these structures in these experiments. 

      Minor comments

      (1) Italicized numbers in parenthesis are present in several places in the manuscript but it is not clear what these refer to (perhaps differently formatted citations from a previous version of the manuscript). Figure 1

      legend: (121); Figure S3 legend: (110), (111); Figure S6 legend: (66); etc.

      We thank the reviewer for pointing out this issue with the references, this was amended.

      (2) Figure 5A and legend: "BSD-R: BSD-resistance gene". Blasticidin-S (BS) is the drug while Blasticidin-S deaminase (BSD) is the resistance gene.

      We thank the reviewer for pointing this out, the legend and figure were changed.

      (3) Figure 5E legend: µ-SBP1-N should be α-SBP1-N.

      This was amended.

      (4) Figure S5 legend: "(Full data in Table S1)" should be Table S3.

      This was amended.

      (5) Figure S1G: The pie chart shows PF3D7_0425700 accounts for 43% of rif expression in 3D7var0425800 but the text indicates 62%.

      We apologize for this mistake, the text was corrected. We also improved the citations to Fig. S1G and H in this section.

      (6) "most PfEMP1-trafficking proteins show a similar early expression..." The authors might consider including a table of proteins known to be required for EMP1 trafficking and a graph showing their expression timing. Are any with later expressions known?

      Most exported proteins are expressed early, which is nicely shown in Marti et al 2004 (cited for the statement) in a graph of the expression timing of all PEXEL proteins (Fig. 4B in that paper). PNEPs also have a similar profile (Grüring et al 2011, also cited for that statement), further illustrated by using early expression as a criterion to find more PNEPs (Heiber et al., 2013 (PMID: 23950716)). Together this includes most if not all of the known PfEMP1 trafficking proteins. The originally co-submitted paper (Blancke-Soares & Stäcker et al., eLife preprint doi.org/10.7554/eLife.103633.1) analysed several later expressed exported proteins

      (Pf332, MSRP6) but their disruption, while influencing Maurer’s clefs morphology and anchoring, did not influence PfEMP1 transport. However, there are some conflicting results for Pf332 (referenced in Blancke-Soares & Stäcker et al). This illustrates that it may not be so easy to decide which proteins are bona fide PfEMP1 trafficking proteins. We therefore did not add a table and hope it is acceptable for the reader to rely on the provided 3 references to back this statement.

      (7)  Figure S1J: The predominate var in the IT4 WT parent is var66 (which appears to be syntenic with Pf3D7_0809100, the predominate var in the 3D7 WT parent). Is there something about this locus or parasite culture conditions that selects for these vars in culture? Is this observed in other labs as well?

      This is a very interesting point (although we are not certain these vars are indeed syntenic, they are on different chromosomes). As far as we know at least Pf3D7_0809100 is commonly a dominant var transcribed in other labs and was found expressed also in sporozoites (Zanghì et al. Cell Rep. 2018). However, it is unclear how uniform this really is. For IT4 we do not know in full but have also here commonly observed centromeric var genes to be dominating transcripts in unselected parasite cultures. It is possible that transcription drifts to centromeric var genes in cultured parasites. However, given the anecdotal evidence, it is unknown to which extent this is related to an inherent switching and regulation regiment or a consequence of faulty regulation following prolonged culturing.

      (8) Figure 4B, C: Presumably the asterisks on the DNA gels indicate non-specific bands but this is not described in the legend. Why are non-specific bands not consistent between parent and integrated lanes?

      We apologize for not mentioning this in the legend, this was amended.

      It is not clear why the non-specific bands differ between the lines but in part this might be due to different concentrations and quality of DNA preps. A PCR can also behave differently depending on whether the correct primer target is present or not. If present, the PCR will run efficiently and other spurious products will be outcompeted, but in absence of the correct target, they might become detectable.  

      Overall, we do not think the non-specific bands are indications of anything untoward with the lines, as for instance in Fig. 4B the high band in the 5’ integration in the IT4 line (that does not occur anywhere else) can’t be due to a genomic change as this is the parental line and does not contain the plasmid for integration. In the same gel, the ori locus band of incorrect size (likely due to crossreaction of the primers to another var gene which due to the high similarity of the ATS region is not always fully avoidable), is present in both, the parent IT4 and the integrant line which therefore also is not of concern. In C there are a couple of bands of incorrect size in the Integration line. One of these is very faint and both are too large and again therefore are likely other vars that are inefficiently picked up by these primers. The reason they are not seen in the parent line is that there the correct primer binding site is present, which then efficiently produces a product that outcompetes the product derived from non-optimal matching primer products and hence appear in the Int line where the correct match is not there anymore. For these reasons we believe these bands are not of any concern.  

      (9) Figure 4C: Is there a reason KAHRP was used as a co-marker for the IFA detecting IT4var19 expression instead of SBP1 which was used throughout the rest of the study?

      This is a coincidence as this line was tested when other lines were tested for KAHRP. As there were foci in the host cell we were satisfied that the HA-tagged PfEMP1 is produced and the localization deemed plausible. 

      (10) Figure 6: Streptavidin labeling for the IT4var01-BirA position 3 line is substantially less than the other two lines in both IFA and WB. Does the position 3 fusion reduce PfEMP1 protein levels or is this a result of the context or surface display of the fusion? Interestingly, the position 3 trypsin cleavage product appears consistently more robust compared with the other two configurations. Does this indicate that positioning BirA upstream of the TM increases RBC membrane insertion and/or makes the surface localized protein more accessible to trypsin?

      It is possible that RBC membrane insertion or trypsin accessibility is increased for the position 3 construct. But there could also be other explanations:

      The reason for the more robustly detected protected fragment for the position 3 construct in the WB might also be its smaller size (in contrast to the other two versions, it does not contain BirA*) which might permit more efficient transfer to the WB membrane. In that case the more robust band might not (only) be due to better membrane insertion or better trypsin accessibility.

      The lower biotinylation signal with the position 3 construct might also be explained by the farther distance of BirA* to the ATS (compared to position 1 and 2), the region where interactors are expected to bind. The position 1 and 2 constructs may therefore generally be more efficient (as closer) to biotinylate ATS proximal proteins. Further, in the final destination (PfEMP1 inserted into the RBC membrane) BirA* would be on the other side of the membrane in the position 3 construct while in the position 1 and 2 constructs BirA* would be on the side of the membrane where the ATS anchors PfEMP1 in the knob structure. In that case, labelling with position 3 would come from interactions/proximities during transport or at the Maurer’s clefts (if there indeed PfEMP1 is not membrane embedded) and might therefore be less.

      Hence, while alterations in trypsin accessibility and RBC membrane insertion are possible explanations, other explanations exist. At present, we do not know which of these explanations apply and therefore did not mention any of them in the manuscript. 

      Reviewer #3 (Recommendations for the authors):

      (1) In the abstract and on page 8, the authors mention that they generate cell lines binding to "all major endothelial receptors" and "all known major receptors". This is a pretty allencompassing statement that might not be fully accepted by others who have reported binding to other receptors not considered in this paper (e.g. VCAM, TSP, hyaluronic acid, etc). It would be better to change this statement to something like "the most common endothelial receptors" or "the dominant endothelial receptors", or something similar.

      We agree with the reviewer that these statements are too all-encompassing and changed them to “the most common endothelial receptors” (introduction) and “the most common receptors” (results).

      (2) The authors targeted two rif genes for activation and in each case the gene became the most highly expressed member of the family. However, unlike var genes, there were other rif genes also expressed in these lines and the activated copy did not always make up the majority of rif mRNAs. The authors might wish to highlight that this is inconsistent with mutually exclusive expression of this gene family, something that has been discussed in the past but not definitively shown.

      We thank the reviewer for highlighting this, we now added the following statement to this section: “While SLI-activation of rif genes also led to the dominant expression of the targeted rif gene, other rif genes still took up a substantial proportion of all detected rif transcripts, speaking against a mutually exclusive expression in the manner seen with var genes.”

      (3) In Figure 6, H-J, the authors display volcano plots showing proteins that are thought to interact with PfEMP1. These are labeled with names from the literature, however, several are named simply "1, 2, 3, 4, 5, or 6". What do these numbers stand for?

      We apologize for not clarifying this and thank the reviewer for pointing this out. There is a legend for the numbered proteins in what is now Table S4 (previously Table S3). We now amended the legend of Figure 6 to explain the numbers and pointing the reader to Table S4 for the accessions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      The parts of the text that have been changed.The major changes are as follows:

      We re-analyzed the dataset and improved the local resolution of the extracellular region (Author response image 1).

      We re-modeled based on the improved density and canceled the bicarbonate model based on comments from all reviewers.

      We performed calcium assay using cell lines stably expressing the mutants, whose surface expression levels were analyzed by fluorescence-activated cell sorting (FACS)<br /> (Figure 3F, G and Figure 3–figure supplement 1-3).

      Thus, we significantly revised our discussion of the extracellular binding pocket and the result of the mutational study. In the revised manuscript, we speculate that H307 is a candidate for the bicarbonate binding site.

      Author response image 1.

      Figure Comparison of local resolution between re-analyzed and previous maps.A Side and top view of the re-analyzed receptor-focused map of GPR30 colored by local resolution. B Side and top view of the previous receptor-focused map of GPR30 colored by local resolution

      Reviewer #1 (Public Review):

      Summary:

      This study resolves a cryo-EM structure of the GPCR, GPR30, which was recently identified as a bicarbonate receptor by the authors' lab. Understanding the ligand and the mechanism of activation is of fundamental importance to the field of receptor signaling. However, the main claim of the paper, the identification of the bicarbonate binding site, is only partly supported by the structural and functional data, leaving the study incomplete.

      Strengths:

      The overall structure, and proposed mechanism of G-protein coupling seem solid. The authors perform fairly extensive unbiased mutagenesis to identify a host of positions that are important to G-protein signaling. To my knowledge, bicarbonate is the only physiological ligand that has been identified for GPR30, making this study a particularly important contribution to the field.

      Weaknesses:

      Without higher resolution structures and/or additional experimental assessment of the binding pocket, the assignment of the bicarbonate remains highly speculative. The local resolution is especially poor in the ECL loop region where the ligand is proposed to bind (4.3 - 4 .8 Å range). Of course, sometimes it is difficult to achieve high structural resolution, but in these cases, the assignment of ligands should be backed up by even more rigorous experimental validation.The functional assay monitors activation of GPR30, and thus reports on not only bicarbonate binding, but also the integrity of the allosteric network that transduces the binding signal across the membrane. Thus, disruption of bicarbonate signaling by mutagenesis of the putative coordinating residues does not necessarily mean that bicarbonate binding has been disrupted. Moreover, the mutagenesis was apparently done prior to structure determination, meaning that residues proposed to directly surround bicarbonate binding, such as E218, were not experimentally validated. Targeted mutagenesis based on the structure would strengthen the story.

      Moreover, the proposed bicarbonate binding site is surprising in a chemical sense, as it is located within an acidic pocket. The authors cite several other structural studies to support the surprising observation of anionic bicarbonate surrounded by glutamate residues in an acidic pocket (references 31-34). However, it should be noted that in general, these other structures also possess a metal ion (sodium or calcium) and/or a basic sidechain (arginine or lysine) in the coordination sphere, forming a tight ion pair. Thus, the assigned bicarbonate binding site in GPR30 remains an anomaly in terms of the chemical properties of the proposed binding site.

      Thank you for your insightful comments. Based on the weaknesses you pointed out, we reconstructed the receptor based on the improved density and removed the bicarbonate model. We performed calcium assays using cell lines stably expressing the variant based on the structure.

      Reviewer #2(Public Review):

      Summary:

      In this manuscript, "Cryo-EM structure of the bicarbonate receptor GPR30," the authors aimed to enrich our understanding of the role of GPR30 in pH homeostasis by combining structural analysis with a receptor function assay. This work is a natural development and extension of their previous work (PMID: 38413581). In the current body of work, they solved the first cryo-EM structure of the human GPR30-G-protein (mini-Gsqi) complex in the presence of bicarbonate ions at 3.21 Å resolution. From the atomic model built based on this map, they observed the overall canonical architecture of class A GPCR and also identified 4 extracellular pockets created by extracellular loops (ECLs) (Pockets A-D). Based on the polarity, location, and charge of each pocket, the authors hypothesized that pocket D is a good candidate for the bicarbonate binding site. To verify their structural observation, on top of the 10 mutations they generated in the previous work, the authors introduced another 11 mutations to map out the essential residues for the bicarbonate response on hGPR30. In addition, the human GPR30-G-protein complex model also allowed the authors to untangle the G-protein coupling mechanism of this special class A GPCR that plays an important role in pH homeostasis.

      Strengths:

      As a continuation of their recent Nature Communication publication (PMID: 38413581), this study was carefully designed, and the authors used mutagenesis and functional studies to confirm their structural observations. This work provided high-resolution structural observations for the receptor in complex with G-protein, allowing us to explore its mechanism of action, and will further facilitate drug development targeting GPR30. There were 4 extracellular pockets created by ECLs (Pockets A-D). The authors were able to filter out 3 of them and identified that pocket D was a good candidate for the bicarbonate binding site based on the polarity, location, and charge of each pocket. From there, the authors identified the key residues on GPR30 for its interaction with the substrate, bicarbonate. Together with their previous work, they carefully mapped out nine amino acids that are critical for receptor reactivity.

      Weaknesses:

      It is unclear how novel the aspects presented in the new paper are compared to the most recent Nature Communications publication (PMID: 38413581). Some areas of the manuscript appear to be mixed with the previous publication. The work is still impactful to the field. The new and novel aspects of this manuscript could be better highlighted.

      I also have some concerns about the TGFα shedding assay the authors used to verify their structural observation. I understand that this assay was also used in the authors' previous work published in Nature Communications. However, there are still several things in the current data that raised concerns:

      Thank you for your insightful comments. Based on the weaknesses you pointed out, we highlighted the new and novel aspects of this manuscript could be better highlighted.l. We performed calcium assays using cell lines stably expressing the variant based on the structure.

      (1) The authors confirmed the "similar expression levels of HA-tagged hGPR30" mutants by WB in Supplemental Figure 1A and B. However, compared to the hGPR30-HA (~6.5 when normalized to the housekeeping gene, Na-K-ATPase), several mutants of the key amino acids had much lower surface expression: S134A, D210A, C207A had ~50% reduction, D125A had ~30% reduction, and Q215A and P71A had ~20% reduction. This weakens the receptor reactivity measured by the TGFα shedding assay.

      Since the calcium assay data is included in the main figure, the TGFα shedding assay and WB expression quantification data are Figure 3. –– supplement figure 1-4, but we included an explanation of the expression levels in the figure caption.

      (2) In the previous work, the authors demonstrated that hGPR30 signals through the Gq signaling pathway and can trigger calcium mobilization. Given that calcium mobilization is a more direct measurement for the downstream signaling of hGPR30 than the TGFα shedding assay, pairing the mutagenesis study with the calcium assay will be a better functional validation to confirm the disruption of bicarbonate signaling.

      According to the suggestion, we performed calcium assay using cell lines stably expressing the mutants (Figure 3F, G and Figure 3–figure supplement 1-3).

      (3) It was quite confusing for Figure 4B that all statistical analyses were done by comparing to the mock group. It would be clearer to compare the activity of the mutants to the wild-type cell line.

      Thank you for your comment. As you mentioned, the comparisons are made between wild-type GPR30 and mutants in the revised manuscript (Figure 3G, Figure 3.—figure supplement 4B)

      Additional concerns about the structural data include

      (1) E218 was in close contact with bicarbonate in Figure 4D. However, there is no functional validation for this observation. Including the mutagenesis study of this site in the cell-based functional assay will strengthen this structural observation.

      We cancelled the bicarbonate model, and we performed mutation analysis targeting all residues facing the binding pocket using cell lines that stably express variants including E218A.

      (2) For the flow chart of the cryo-EM data processing in Supplemental data 2, the authors started with 10,148,422 particles after template picking, then had 441,348 Particles left after 2D classification/heterogenous refinement, and finally ended with 148,600 particles for the local refinement for the final map. There seems to be a lot of heterogeneity in this purified sample. GPCRs usually have flexible and dynamic loop regions, which explains the poor resolution of the ECLs in this case. Thus, a solid cell-based functional validation is a must to assign the bicarbonate binding pocket to support their hypothesis.

      We re-analyzed the dataset and improved the local resolution of the extracellular region (Author response image 1) and cancelled the bicarbonate model. Yet, as suggested by the reviewer, solid cell-based functional validation is efficient to analyze the receptor function response to bicarbonate. Thus, we performed mutation analysis targeting all residues facing the binding pocket using cell lines stably expressing the mutants, whose surface expression levels were analyzed by FACS (Figure 3F, G and Figure 3.––figure supplement 1-3).

      Reviewer #3 (Public Review):

      Summary:

      GPR30 responds to bicarbonate and regulates cellular responses to pH and ion homeostasis. However, it remains unclear how GPR30 recognizes bicarbonate ions. This paper presents the cryo-EM structure of GPR30 bound to a chimeric mini-Gq in the presence of bicarbonate. The structure together with functional studies aims to provide mechanistic insights into bicarbonate recognition and G protein coupling.

      Strengths:

      The authors performed comprehensive mutagenesis studies to map the possible binding site of bicarbonate.

      Weaknesses:

      Owing to the poor resolution of the structure, some structural findings may be overclaimed.

      Based on EM maps shown in Figure 1a and Figure Supplement 2, densities for side chains in the receptor particularly in ECLs (around 4 Å) are poorly defined. At this resolution, it is unlikely to observe a disulfide bond (C130ECL1-C207ECl2) and bicarbonate ions. Moreover, the disulfide between ECL1 and ECL2 has not been observed in other GPCRs and the published structure of GPR30 (PMID: 38744981). The density of this disulfide bond could be noise.

      The authors observed a weak density in pocket D, which is accounted for by the bicarbonate ions. This ion is mainly coordinated by Q215 and Q138. However, the Q215A mutation only reduced but not completely abolished bicarbonate response, and the author did not present the data of Q138A mutation. Therefore, Q215 and Q138 could not be bicarbonate binding sites. While H307A completely abolished bicarbonate response, the authors proposed that this residue plays a structural role. Nevertheless, based on the structure, H307 is exposed and may be involved in binding bicarbonate. The assignment of bicarbonate in the structure is not supported by the data.

      Thank you for your insightful comments. Based on the weaknesses you pointed out, we reconstructed the receptor based on the improved density and removed the bicarbonate model. We performed calcium assays using cell lines stably expressing the variant based on the structure.

      Reviewer #1 (Recommendations For The Authors):

      (1) The experimental validation of the bicarbonate binding could be strengthened by developing an assay that directly monitors bicarbonate binding (rather than GPCR signaling)

      We agree that a direct binding assay for bicarbonate would be highly attractive (i.e. Filter binding assay using 14C-HCO₃⁻). However, the weak affinity of bicarbonate ions (in the mM range) would make reliable radioisotope-based detection impossible due to minimal specific receptor occupancy and high non-specific background and thus it is highly challenging and there are limitations to what can be done in this structural paper.

      and determining a structure at comparable resolution in the absence of bicarbonate. In addition, all residues that are proposed to be located adjacent to the bicarbonate should be mutated and functionally validated.

      We re-modeled the receptor based on the improved density and canceled the bicarbonate model. We performed calcium assay using cell lines stably expressing the mutants (Figure 3F, G and Figure 3.–figure supplement 1-3).

      (2) What are the maps contoured in Figure 4D? The legend should describe this. Is 218 within the map region shown, or is there no density for its sidechain?

      We removed the corresponding figure and cancelled the bicarbonate model.

      (3) The contour level of the maps in Figure 1 - Figure Supplement 2 should also be indicated. Are these all contoured at the same level?

      Thank you for your comment. We re-analyzed the same data set and obtained new density maps and models. We reworked Figure 1 and Figure 1. figure supplement 2; the contour level of the map for Figure 1 and composite map for the Figure 1. figure supplement 2 is the same, 7.65. 

      (4) Regarding the cited structures of bicarbonate-binding proteins, for three of the four cited structures, the bicarbonate is actually coordinated by positive ligands, with the Asp/Glu playing a more peripheral role:

      Capper et al: Overall basic cavity with tight bidentate coordination by Arg. The Glu is 5-6 Å away.

      Koropatkin et al: Two structures. The first, solved at pH 5, is proposed to have carbonic acid bound. The second, solved at pH 8, shows carbonate in a complex with calcium, with the calcium coordinated by carboxylates.

      Wang et al: The bicarbonate is coordinated by a lysine and a sodium ion. The sodium is coordinated by carboxylates.

      The authors should more thoughtfully discuss the unusual properties of this binding site with regard to the previous literature. Is it possible that bicarbonate binds in complex with a metal ion? Could this possibility be experimentally tested?

      We cancelled the bicarbonate model.

      (5) As a structure of GPR30 has been recently published by another group (PMID: 38744981), it would be valuable to discuss structural similarities and differences and discuss how bicarbonate activation and activation by the chloroquine ligand identified by the other group might both be accommodated by this structure.

      Thank you for your valuable comment. We compared the structure presented by another group and added our discussion, as “During the revision of this manuscript, the structures of apo-GPR30-G<sub>q</sub> (PDB 8XOG) and the exogenous ligand Lys05-bound GPR30-G<sub>q</sub> (PDB 8XOF) were reported [42]. We compared our structure of GPR30 in the presence of bicarbonate with these structures. In the extracellular region, the position of TM5 in GPR30 in the presence of bicarbonate is similar to that in apo-GPR30. In contrast, the position of TM6 is shifted outward relative to that of apo-GPR30, resembling the conformation observed in Lys05-bound GPR30 (Figure 6A, B). Additionally, the position of ECL1 is also shifted outward compared to that of apo-GPR30 (Figure 6B). In the GPR30 structure in the presence of bicarbonate, ECL2 was modeled, suggesting differences in structural flexibility. These findings indicate that the structure of GPR30 in the presence of bicarbonate is different from both the apo structure and the Lys05-bound structure, demonstrating that the structure and the flexibility of the extracellular domain of GPR30 change depending on the type of ligand. Furthermore, focusing on the interaction with G<sub>q</sub>, the αN helix of G<sub>q</sub> is not rotated in the structure bound to Lys05, in contrast to the characteristic bending of the αN helix in our structure (Figure 6C, D). Although it is necessary to consider variations in experimental conditions, such as salt concentration, the differences in the G<sub>q</sub> binding modes suggest that the downstream signals may change in a ligand-dependent manner.” (lines 249-266).

      Reviewer #2 (Recommendations For The Authors):

      (1) It is highly recommended that the authors carefully go through the "insights into bicarbonate binding" section. The results of the new findings in this paper were blended in with the results from the previous work: the importance of E115, Q138, and H307 in the receptor-bicarbonate interaction was shown in the Nature Communication paper but the authors didn't make it clear, which added a little confusion.

      We emphasized this fact in the main text (lines 130-132).

      (2) It would be nice for the authors to add some content about the physiological concentration of HCO3 or refer more to their previous work about the rationale for selecting the bicarbonate dose in their functional assay.

      Thank you for your comment. The physiological concentration of bicarbonate is 22-26 mM in the extracellular fluid, including interstitial fluid and blood, and 10-12 mM in the intracellular fluid. The bicarbonate concentration alters in various physiological and pathological conditions – metabolic acidosis in chronic kidney disease causes a drop to 2-3 mM, and metabolic alkalosis induced by severe vomiting increases HCO<sub>3</sub><sup>-</sup> concentrations more than 30 mM. Thus, our present and previous works clearly show that GPR30 is activated by physiological concentrations of bicarbonate, whether it is localized intracellularly or on the membrane, and that GPR30 can be deactivated or reactivated in various pathophysiological conditions. We added this in the discussion section (lines 267-278).

      (3) In Figure 3A, in the legend, the authors mentioned: "black dashed lines indicate hydrogen bonds". No hydrogen bond was noted in the figure.

      We totally corrected Figure 3.

      (4) Figure 3B, it would be helpful for the authors to denote the meaning of the blue-white-red color coding in the legend.

      We removed the figure.

      (5) Supplemental Figure 3: since AF3 was released on May 3rd, it would be awesome in the revision version if the authors would update this to the AF3 model.

      The AF2 model has been replaced with the AF3. (Figure 2–figure supplement 2A-C). The AF2 and AF3 models are almost identical, and they form incorrect disulfide bonds. This confirms the usefulness of the experimental structural determination in this study.

      (6) Supplemental Figure 4: it wasn't clear to me if the expression experiments were repeated multiple times or if there was any statistical analysis for the expression level was done in this study.

      We performed the expression experiment by western blotting once and did not perform statistical analyses. We performed repeated FACS analyses of HEK cells stably expressing N-terminally HA-tagged wild-type or mutant GPR30s to analyze their membrane and whole-cell expressions during revision (Figure 3.–figure supplement 1-3). Using these stable cells, we performed calcium assays using cell lines stably expressing the mutants (Figure 3F, G and Figure 3–figure supplement 1-3).

      (7) Supplemental Figure 4: Also, is there a reason for the authors to compare the expression level of hGPR30 to the housekeeping gene NA-K-ATPase rather than the total loaded protein? Traditionally housekeeping genes have been used as loading controls to semiquantitatively compare the expression of target proteins in western blots. However, numerous recent studies show that housekeeping proteins can be altered due to experimental conditions, biological variability across tissues, or pathologies. A consensus has developed for using total protein as the internal control for loading. An editorial from the Journal of Biological Chemistry reporting on "Principles and Guidelines for Reporting Preclinical Research" from the workshop held in June 2014 by the NIH Director's Office, Nature Publishing Group, and Science stated, "It is typically better to normalize Western blots using total protein loading as the denominator".

      Thank you for your instructive comment. We evaluated western blotting with the same amount of total protein loaded 20 µg for whole-cell lysate and 1.5 µg for cell surface protein (Figure 3.–figure supplement 3C-F).

      Reviewer #3 (Recommendations For The Authors):

      The claim about this disulfide should be removed unless the authors can provide mass spec evidence.

      Thank you for your crucial comments. Firstly, C130 is a residue of TM3, not ECL1, so our misprint has been corrected to C130<sup>3.25</sup>. C207<sup>ECL2</sup>, located at position 45.50, is the most conserved residue in ECL2, and it forms a disulfide bond with cysteine at position 3.25 (PMID: 35113559). The paper was additionally cited regarding the preservation of the bond of C130<sup>3.25</sup>-C207<sup>ECL2</sup> (line 103). Indeed, disruption of this disulfide bond by the C207<sup>ECL2</sup> A mutation resulted in a marked reduction in receptor activity. In addition, the data set was re-analyzed to improve the local resolution of the extracellular region, and it was shown that the density of ECL2 is not noise (Figure 2. ––figure supplement 2). We are confident about the presence of the disulfide bond, based on the structural analysis data and the conservation.

      The highly flexible extracellular region is greatly affected by experimental conditions and ligands, so we speculate that the ECL2 and the disulfide bond was not observed in other reported structures of GPR30. Then, we have added the following content to the discussion, as “In the GPR30 in the presence of bicarbonate, ECL2 was modelled, suggesting differences in structural flexibility.” (lines 256-257).

      The authors should remove the assignment of bicarbonate in the structure, and tone down the binding site of bicarbonate.

      We cancelled the bicarbonate model.

      Minor:

      (1) The potency of bicarbonate for GPR30 is in the mM range. Although the concentration of bicarbonate in the serum can reach mM range, how about its concentration in the tissues? Given its low potency, it may be not appropriate to claim GPR30 is a bicarbonate receptor at this point, but the authors can claim that GPR30 can be activated by or responds to bicarbonate.

      The physiological concentration of bicarbonate is 22-26 mM in the extracellular fluid, including interstitial fluid and blood, and 10-12 mM in the intracellular fluid. Therefore, GPR30 is activated by physiological concentrations of bicarbonate in the tissues. Also, the bicarbonate concentration alters in various physiological and pathological conditions – metabolic acidosis in chronic kidney disease causes a drop to 2-3 mM, and metabolic alkalosis induced by severe vomiting increases HCO3- concentrations more than 30 mM. Thus, our work clearly shows that GPR30 is activated by physiological concentrations of bicarbonate, whether it is localized intracellularly or on the membrane, and that GPR30 can be deactivated or reactivated in various pathophysiological conditions. According to the reasons above, we claim GPR30 is a bicarbonate receptor (lines 267-278).

      (2) The description that there is no consensus on a drug that targets GPR30 is not accurate, since lys05 has been reported as an agonist of GPR30 and their structure is published (PMID: 38744981). The published structures of GPR30 should be introduced in the paper.

      We added the discussion about the structural comparison with the Lys05-bound structure (Figure 6, lines 249-266)

      (3) BW numbers in Figure 4A should be shown.

      We added BW numbers in the figures of the mutational studies.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study presents a new Bayesian approach to estimate importation probabilities of malaria, combining epidemiological data, travel history, and genetic data through pairwise IBD estimates. Importation is an important factor challenging malaria elimination, especially in low-transmission settings. This paper focuses on Magude and Matutuine, two districts in southern Mozambique with very low malaria transmission. The results show isolation-by-distance in Mozambique, with genetic relatedness decreasing with distances larger than 100 km, and no spatial correlation for distances between 10 and 100 km. But again, strong spatial correlation in distances smaller than 10 km. They report high genetic relatedness between Matutuine and Inhambane, higher than between Matutuine and Magude. Inhambane is the main source of importation in Matutuine, accounting for 63.5% of imported cases. Magude, on the other hand, shows smaller importation and travel rates than Matutuine, as it is a rural area with less mobility. Additionally, they report higher levels of importation and travel in the dry season, when transmission is lower. Also, no association with importation was found for occupation, sex, and other factors. These data have practical implications for public health strategies aiming for malaria elimination, for example, testing and treating travelers from Matutuine in the dry season.

      Strengths:

      The strength of this study lies in the combination of different sources of data - epidemiological, travel, and genetic data - to estimate importation probabilities, and the statistical analyses.

      Weaknesses:

      The authors recognize the limitations related to sample size and the biases of travel reports.

      We appreciate the review and comment about the manuscript.

      Reviewer #2 (Public review):

      Summary:

      Based on a detailed dataset, the authors present a novel Bayesian approach to classify malaria cases as either imported or locally acquired.

      Strengths:

      The proposed Bayesian approach for case classification is simple, well justified, and allows the integration of parasite genomics, travel history, and epidemiological data. The work is well-written, very organized, and brings important contributions both to malaria control efforts in Mozambique and to the scientific community. Understanding the origin of cases is essential for designing more effective control measures and elimination strategies.

      Weakness:

      While the authors aim to classify cases as imported or locally acquired, the work lacks a quantification of the contribution of each case type to overall transmission.

      The method presented here allows for classifying individual cases according to whether the infection occurred locally or was imported during a trip. By definition, it does not look to secondary infections after an importation event. Our next step is to conduct outbreak investigation to quantify the impact of importation events on the overall transmission, but this activity goes beyond the scope of this manuscript. We clarify this in the discussion section.

      The Bayesian rationale is sound and well justified; however, the formulation appears to present an inconsistency that is replicated in both the main text and the Supplementary Material.

      Thank you for pointing out the inconsistency in the final formula. In fact, the final formula corresponds to P(IA | G), instead of P(IA), so:

      instead of

      We have now corrected this error in the new version of the manuscript.

      Reviewer #3 (Public review):

      The authors present an important approach to identify imported P. falciparum malaria cases, combining genetic and epidemiological/travel data. This tool has the potential to be expanded to other contexts. The data was analyzed using convincing methods, including a novel statistical model; although some recognized limitations can be improved. This study will be of interest to researchers in public health and infectious diseases.

      Strengths:

      The study has several strengths, mainly the development of a novel Bayesian model that integrates genomic, epidemiological, and travel data to estimate importation probabilities. The results showed insights into malaria transmission dynamics, particularly identifying importation sources and differences in importation rates in Mozambique. Finally, the relevance of the findings is to suggest interventions focusing on the traveler population to help efforts for malaria elimination.

      Weaknesses:

      The study also has some limitations. The sample collection was not representative of some provinces, and not all samples had sufficient metadata for risk factor analysis, which can also be affected by travel recall bias. Additionally, the authors used a proxy for transmission intensity and assumed some conditions for the genetic variable when calculating the importation probability for specific scenarios. The weaknesses were assessed by the authors.

      We acknowledge the limitations commented by the reviewer. We have the following plans to address the limitations. We will repeat the study for our data collected in 2023, which this time contains a good representation of all the provinces of Mozambique, and completeness of the metadata collection was ensured by implementing a new protocol in January 2023. Regarding the proxy for transmission intensity, we will refine the model by integrating monthly estimates of malaria incidence (previously calibrated to address testing and reporting rates) from the DHIS2 data, taking also into account the date of the reported cases in the analysis.

      Reviewing Editor Comments:

      The reviewers have made specific suggestions that could improve the clarity and accuracy of this report.

      Reviewer #1 (Recommendations for the authors):

      (1) Abstract, lines 36, 37 and 38: "Spatial genetic structure and connectivity were assessed using microhaplotype-based genetic relatedness (identity-by-descent) from 1605 P. falciparum samples collected (...)", but only 540 samples were successfully sequenced, therefore used in spatial genetic structure and connectivity analysis.

      The 540 samples refer to those from Maputo province and are described in Fig. 1. The Spatial and connectivity analyses also included the samples from the rest of the provinces from the multi-cluster sampling scheme. Sample sizes from these provinces are described in Suppl. Table 2, and the total between them and the 540 samples from Maputo are the 1605 samples mentioned in the abstract. We specify this number in the caption of Sup. Fig. 4, and add it now into Fig. 3

      (2) In the Introduction, some epidemiological context about Magude and Matutuine could be added. It is only mentioned in the Discussion section (lines 265-269).

      We have added some context about both districts in the introduction now.

      (3) In the Discussion, lines 241-244, could the lack of structure mean no barriers for gene flow due to high mobility in short distances? Maybe it could only be resolved with a large number of samples.

      This could be an explanation (we mention it in the new version), although it is not something we can prove, or at least in this study.

      Reviewer #2 (Recommendations for the authors):

      The work is well written, very organized, and brings important contributions both to malaria control efforts in Mozambique and to the scientific community. Based on detailed datasets from Mozambique, the authors present a novel Bayesian approach to classify malaria cases as either imported or locally acquired. Understanding the origin of cases is essential for designing more effective control measures and elimination strategies. My review focuses on the Bayesian approach as well as on a few aspects of the presentation of results.

      The authors combine travel history, parasite genetic relatedness, and transmission intensity from different areas to compute the probability of infection occurring in the study area, given the P. falciparum genome. The Bayesian rationale is sound and well justified; however, the formulation appears to present an inconsistency that is replicated in both the main text and the Supplementary Material. According to Bayes' Rule:

      P(I_A |G) = (P(I_A) ∙ P(G|I_A)) / (P(G)),

      with

      P(I_A) = K ∙ T_A ∙ PR_A,

      P(G│I_A) = R'_A,

      and assuming

      P(I_A│G) + P(I_B│G) = 1,

      the expression,

      (T_A ∙ PR_A ∙ R'_A) / (T_A ∙ PR_A ∙ R'_A + T_B ∙ PR_B ∙ R'_B)

      appears to refer to P(I_A│G), not to P(I_A) (as indicated in the main text and Supplementary Material).

      P(I_A│G) + P(I_B│G) = (P(I_A) ∙ P(G|I_A) + P(I_B) ∙ P(G|I_B)) / P(G) = 1

      ⇒P(G) = P(I_A) ∙ P(G|I_A) + P(I_B) ∙ P(G|I_B)

      ⇒P(G) = K ∙ T_A ∙ PR_A ∙ R'_A + K ∙ T_B ∙ PR_B ∙ R'_B

      ⇒P(I_A│G) = (T_A ∙ PR_A ∙ R'_A) / (T_A ∙ PR_A ∙ R'_A + T_B ∙ PR_B ∙ R'_B)

      Please clarify this.

      As mentioned in a previous comment, we acknowledge this point from the reviewer.  In fact, the final formula corresponds to P(IA | G), instead of P(IA), so:

      instead of

      We have now corrected this error in the new version of the manuscript and in the supplementary information.

      Additional comments:

      (1) Figure 3A has a scale that includes negative values, which is not reasonable for R.

      We agree that R estimates are not compatible with negative values. The intention of this scale was to show the overall mean R in the centre, in white, so that blue colours represented values below the average and red values above the average. However, we proceeded to update the figures according to your recommendations.

      (2) I suggest using a common scale from 0 to 0.12 (maximum values among panels) across panels A, C, and D, as well as in Sup Fig 3, to facilitate comparison.

      We updated the figures according to the recommendations.

      (3) The x-axis labels in Figure 3A and Supplementary Figure 2A are not aligned with the x-axis ticks.

      We updated the figures so that the alignment in the x-axis is clear.

      (4) Supplementary Figure 5 would be better presented if the data were divided into four separate panels.

      We have divided the figure into four separate panels.

      (6) Figure 5D is not referenced in the main text.

      We missed the mention, which is now fixed in the new version.

      (7) The authors state: "No significant differences in R were found comparing parasite samples from Magude and the rest of the districts." However, Supplementary Figure 3 shows statistically significant relatedness between parasites from Magude and Matutuine. Please clarify this.

      Answer: we added clarity to this sentence which was indeed confusing.

      Reviewer #3 (Recommendations for the authors):

      (1) Introduction: More background info about malaria in Mozambique would be appreciated.

      We included some contextualisation about malaria in Mozambique and our study districts.

      (2) Why were most of the samples collected from children? Is malaria most prevalent in that group? Information could be added in the introduction.

      Children are usually considered an appropriate sentinel group for malaria surveillance for several reasons. First, most malaria cases reported from symptomatic outpatient visits are children, especially in areas with moderate to high burden. Second (and probably the cause for the first reason), their lower immunity levels, due to lower time of exposure, and their immature system, provides a cleaner scenario of the effects of malaria, since the body response is less adapted from past exposures. Finally, as a vulnerable population, they deserve a stronger focus in surveillance systems. We added a comment in the introduction referring to them as a common sentinel group for surveillance.

      (3) Minor: Check spaces in the text (for example, line 333 and the start of the Discussion).

      Thank you for noticing, we fixed in in the new version

      (4) Minor: In my case, the micro (u) symbol can be observed in Word, but not in PDF.

      One of the symbols produced an error, we hope that the new version is correct now.

      (5) Were COI calculations with MOIRE performed across provinces and regions, or taking all samples as one population?

      Wwe took all samples as one population. However, we validated that the same results (reaching equivalent numbers and the same conclusions) were obtained when run across different populations (regions or provinces). We mention this in the manuscript now.

      (6) Have you tested lower values than 0.04 for PR in Maputo?

      This would not have had any impact in the classification. Only two individuals reported a trip to Maputo city (where we assumed PR=0.04), and none of them were classified as imported. If lower values of PR were assumed, their probabilities of importation would have reduced, so that we would still obtain no imported cases.

      (7) Map (Supplementary Figure 1): Please, improve the resolution (like in the zoom in) and add a scale and a compass rose.

      We improved the resolution of the map. We did not add a scale and a compass rose, but labelled the coordinates as longitude and latitude to clarify the scale and orientation of the map. We added this in the rest of the maps of the manuscript as well.

      (8) In this work, Pimp values were bimodal to 0 or 1, making the classification easy. I wonder in other scenarios, where Pimp values are more intermediate (0.4-0.6), is the threshold at 0.5 still useful? Is there another way, like having a confidence interval of Pimp, to ensure the final classification? A discussion on this topic may be appreciated.

      In this case, we would recommend doing probabilistic analyses, keeping the probability of being imported as the final outcome, and quantifying the importation rates from the weighted sum of probabilities across individuals. We added this clarification in the Methods section: “ In case of obtaining a higher fraction of intermediate values (0.4-0.6), weighted sums of individual probabilities would be more appropriate to better quantify importation rates.”

      (9) Results: More details per panel, not as the whole figure (Figure 2B, Figure 3A, etc) in the manuscript would be appreciated.

      We appreciate the comment and added more details

      (10) Figure 3: Please, add a color legend in panel B (not only in the caption, but in the panel, such as in A, C, D).

      We added a color legend in panel B.

      (11) Do the authors recommend routine surveillance to detect importation in Mozambique, or are these results solid enough to propose strategies? How possible is it that importation rates vary in the future in the south? If so, how feasible is it to implement all this process (including the amplicon sequencing) routinely?

      We added the following text in the discussion: “While these results propose programmatic strategies for the two study districts, routine surveillance to detect importation in Mozambique would allow for identifying new strategies in other districts aiming for elimination, as well as monitoring changes in importation rates in Magude and Matutuine in the future. If scaling molecular surveillance is not feasible, travel reports could be integrated in the routing surveillance to extrapolate the case classification based on the results of this study. “

      (12) Which other proxies of transmission intensity could have been used?

      Better proxies of transmission intensity could be malaria incidence at the monthly level from national surveillance systems, or estimates of force of infection, for example from the use of molecular longitudinal data if available. We added this text in the discussion.

      (13) Can this strategy be applied to P. vivax-endemic areas outside Africa?

      This new method can also be applied to P. vivax-endemic areas outside Africa. Symptomatic P. vivax cases are not necessarily reflecting recent infections, so that travel reports might need to cover longer time periods, which does not require any essential adaptation to the method. We added this text to the discussion.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Colorectal cancer (CRC) is the third most common cancer globally and the second leading cause of cancer-related deaths. Colonoscopy and fecal immunohistochemical testing are among the early diagnostic tools that have significantly enhanced patient survival rates in CRC. Methylation dysregulation has been identified in the earliest stages of CRC, offering a promising avenue for screening, prediction, and diagnosis. The manuscript entitled "Early Diagnosis and Prognostic Prediction of Colorectal Cancer through Plasma Methylation Regions" by Zhu et al. presents that a panel of genes with methylation pattern derived from cfDNA (27 DMRs), serving as a noninvasive detection method for CRC early diagnosis and prognosis.

      Strengths:

      The authors provided evidence that the 27 DMRs pattern worked well in predicting CRC distant metastasis, and the methylation score remarkably increased in stage III-IV.

      Weaknesses:

      The major concerns are the design of DMR screening, the relatively low sensitivity of this DMR pattern in detecting early-stage CRC, the limited size of the cohorts, and the lack of comparison with the traditional diagnosis test.

      We sincerely thank the reviewer for their thorough evaluation and constructive feedback on our manuscript. We are encouraged that the reviewer found our 27-DMR panel promising for predicting distant metastasis and for its performance in late-stage CRC. We have carefully considered the weaknesses pointed out and have made revisions to address these concerns, which we believe have significantly strengthened our paper.

      We agree with the reviewer that achieving high sensitivity for early-stage disease is the ultimate goal for any noninvasive screening test. Detecting the minute quantities of cfDNA shed from early-stage tumors is a well-recognized challenge in the field. Although the sensitivity of our current panel for early-stage CRC is modest, its core strengths, lie in its capability to also detect advanced adenomas and its excellent performance in assessing CRC metastasis and prognosis. Furthermore, we have now added a direct comparative analysis of our 27-DMR panel against the most widely used clinical serum biomarker for CRC, carcinoembryonic antigen (CEA), using samples from the same patient cohorts. Our results demonstrate that 27-DMR methylation score significantly outperforms CEA in diagnostic accuracy for early-stage CRC (64% vs. 18%) (Table s7). And in the Discussion section, we have also acknowledged our limitations and suggest that future studies are warranted to combine the cfDNA methylation model with commonly used clinical markers, such as CEA and CA19-9, with the aim of improving the sensitivity for early diagnosis.

      We acknowledge the reviewer's concern regarding the cohort size and validation in larger, prospective, multi-center cohorts is essential before this panel can be considered for clinical application. We have explicitly stated this as a limitation of our study in the Discussion section and have highlighted the need for future large-scale validation studies (Page 18, Lines 367-373). We once again thank the reviewer for their insightful comments, which have allowed us to substantially improve our manuscript. We hope that the revised version is now suitable for publication.

      Reviewer #2 (Public review):

      This work presents a 27-region DMR model for early diagnosis and prognostic prediction of colorectal cancer using plasma methylation markers. While this non-invasive diagnostic and prognostic tool could interest a broad readership, several critical issues require attention.

      Major Concerns:

      (1) Inconsistencies and clarity issues in data presentation

      (a) Sample size discrepancies

      The abstract mentions screening 119 CRC tissue samples, while Figure 1 shows 136 tissues. Please clarify if this represents 119 CRC and 17 normal samples.

      We sincerely thank the reviewer for this careful observation and for pointing out the inconsistency. We apologize for the error and the confusion it caused. Regarding Figure 1: The reviewer is correct. The number 136 in the original Figure 1 was an error. This was due to an inadvertent double-counting of the tumor samples that were used in the differential analysis against adjacent normal tissues. The actual number of tissue samples used in this analysis is 89. We have now corrected this value in the revised Figure 1.

      Regarding the Abstract: The 119 CRC tissue samples mentioned in the abstract represents the total number of unique tumor samples analyzed across all stages of our study. This number is composed of two cohorts: the initial 15 pairs of tissues used for preliminary screening, and the subsequent 89 tissue samples used for validation, totaling 119 samples. We have ensured all sample numbers are now consistent throughout the revised manuscript.

      The plasma sample numbers vary across sections: the abstract cites 161 samples, Figure 1 shows 116 samples, and the Supplementary Methods mentions 77 samples (13 Normal, 15 NAA, 12 AA, 37 CRC).

      We sincerely thank the reviewer for their meticulous review and for identifying these inconsistencies in the plasma sample numbers. We apologize for this oversight and the lack of clarity.

      Figure 1 & Supplementary Methods (77 samples): The number 116 in the original Figure 1 was a clerical error. The correct number is 77, which is the cohort used for our differential methylation analysis. This number is now consistent with the Supplementary Methods. This cohort is composed of 13 Normal, 15 NAA, 12 AA, and 37 CRC samples. The figure has been revised accordingly.

      Abstract (161 samples): The total of 161 plasma samples mentioned in the abstract is the sum of two distinct sample sets used for different stages of our analysis: The 77 samples (13 Normal, 15 NAA, 12 AA, 37 CRC) used for the differential analysis.  An additional 84 samples (33 Normal, 51 CRC) which served as the training set for the LASSO regression model. We have now clarified these distinctions in the text and ensured consistency across the abstract, figures, and methods sections.

      (b) Methodological inconsistencies

      The Supplementary Material reports 477 hypermethylated sites from TCGA data analysis (Δβ>0.20, FDR<0.05), but Figure 1 indicates 499 sites.

      The manuscript states that analyzing TCGA data across six cancer types identified 499 CRC-specific methylation sites, yet Figure 1 shows 477. Please also explain the rationale for selecting these specific cancer types from TCGA.

      We sincerely thank the reviewer for their sharp observation and for highlighting these inconsistencies. We apologize for this clerical error, which occurred when labeling the figure. The numbers 477 and 499 in Figure 1 were inadvertently swapped and the text in Supplementary Material is correct. We have now corrected this error throughout the manuscript to ensure clarity and consistency. We deeply regret the confusion this has caused.

      Regarding the rationale for selecting the cancer types:

      The selection of colorectal, esophageal, gastric, lung, liver, and breast cancers was based on the following strategic criteria to ensure the stringent identification of CRC-specific markers. Firstly, esophageal, gastric, liver, and colorectal cancers all originate from the gastrointestinal tract and share developmental and functional similarities. Comparing CRC against these closely related cancers allowed us to filter out general GI-tract-related methylation patterns and isolate those that are truly unique to colorectal tissue. Secondly, we included lung and breast cancer as they are two of the most common non-GI malignancies worldwide with distinct tissue origins. This helps ensure our identified markers are not just pan-cancer methylation events but are specific to CRC, even when compared against highly prevalent cancers from different lineages. Finally, these six cancer types have some of the largest and most complete datasets available in the TCGA database, including high-quality methylation data. This provided a robust statistical foundation for a reliable cross-cancer comparison. We hope this explanation clarifies our methodology. Thank you again for your valuable feedback.

      "404 CRC-specific DMRs" mentioned in the main text while "404 MCBs" in Figure 1, the authors need to clarify if these terms are interchangeable or how MCBs are defined.

      We sincerely thank the reviewer for pointing out this important inconsistency in terminology. We apologize for the confusion this has caused and for the error in Figure 1. The two terms are closely related in our study. The final 404 markers are technically DMRs that were identified through an analysis of MCBs. To avoid confusion, we have decided to unify the terminology. The manuscript has now been revised to consistently use "DMRs", which is the most accurate final descriptor. The label in Figure 1 has been corrected accordingly.

      (2) Methodological documentation

      The Results section requires a more detailed description of marker identification procedures and justification of methodological choices.

      Figure 3 panels need reordering for sequential citation.

      We thank the reviewer for this valuable suggestion. We agree that the original Results section lacked sufficient detail regarding the marker identification procedures and the justification for our methodological choices. To address this, we have substantially rewritten the "Methylation markers selection" subsection. This revised section provides a clear, step-by-step narrative of our marker discovery. The revised text now integrates the specific methodological details and statistical criteria. For instance, we now explicitly describe the three-pronged approach for the initial TCGA data mining and the specific criteria (Δβ, FDR, log2FC) for each, and the analysis methodology such as Wilcoxon test and LASSO regression analysis. We believe this detailed narrative now provides the necessary description and justification for our methodological choices directly within the results, significantly improving the clarity and logical flow of our manuscript. This revision can be found on (Page 9-11, Lines 180-195, 202-213). We hope these changes fully address the reviewer's concerns.

      We thank the reviewer for pointing out the citation order of the panels in Figure 3. This was a helpful suggestion for improving the clarity of our manuscript. We have now reordered the panels in Figure 3 to ensure they are cited sequentially within the text. These adjustments have been made in the "Development and validation of the CRC diagnosis model" subsection of the Results (Page 11, lines 224-230). We appreciate the reviewer's attention to detail.

      (3) Quality control and data transparency

      No quality control metrics are presented for the in-house sequencing data (e.g., sequencing quality, alignment rate, BS conversion rate, coverage, PCA plots for each cohort).

      The analysis code should be publicly available through GitHub or Zenodo.

      At a minimum, processed data should be made publicly accessible to ensure reproducibility.

      We sincerely thank the reviewer for their valuable and constructive feedback regarding quality control and data transparency. We fully agree that these elements are crucial for ensuring the robustness and reproducibility of our research. As the reviewer suggested, we have made all processed data and the key quality control metrics for each sample including sequencing quality scores, bisulfite (BS) conversion rates, and sequencing coverage publicly available to ensure the reproducibility of our findings. The analysis was performed using standard algorithms as detailed in the Methods section. While we are unable to host the code in a public repository at this time, all analysis scripts are available from the corresponding author upon reasonable request. The data has been deposited in the National Genomics Data Center (NGDC) and is accessible under the accession number OMIX009128. This information is now clearly stated in the "Data and Code Availability" section of the manuscript. We thank the reviewer again for pushing us to improve our manuscript in this critical aspect.

      Reviewer #3 (Public review):

      Summary:

      This article provides a model for early diagnosis and prognostic prediction of Colorectal Cancer and demonstrates its accuracy and usability. However, there are still some minor issues that need to be revised and paid attention to.

      Strengths:

      A large amount of external datasets were used for verification, thus demonstrating robustness and accuracy. Meanwhile, various influencing factors of multiple samples were taken into account, providing usability.

      Weaknesses:

      There are notable language issues that hinder readability, as well as a lack of some key conclusions provided.

      We are very grateful to the reviewer for their positive assessment of our study and for the constructive feedback provided. We are particularly encouraged that the reviewer recognized the strengths of our work, especially the robustness demonstrated through extensive external validation and the practical usability of our model. Regarding the weaknesses, we have taken the comments very seriously and have thoroughly revised the manuscript. We sincerely apologize for the language issues that hindered readability in our initial submission. To address this, the entire manuscript has undergone a comprehensive round of professional language polishing and editing. We have carefully reviewed and revised the text to improve clarity, flow, and grammatical accuracy. Besides, we agree that the conclusions could be stated more explicitly. To rectify this, we have substantially revised the final paragraph of the Discussion and the Conclusion section (Page 14-18, lines 279-305, 319-334, 346-348, 358-360, 367-379). We now more clearly summarize the main findings of our study, emphasize the clinical significance and potential applications of our model, and provide clear take-home messages. We thank you again for your time and insightful comments, which have been invaluable in improving the quality of our paper. We hope the revised manuscript now meets the standards for publication.

      Reviewer #1 (Recommendations for the authors):

      Detail comments are outlined below:

      (1) In this study, the authors have highlighted methylated cfDNA as a noninvasive approach for CRC early diagnosis. However, the small size of cohorts for plasma screening, particularly the sample number of NAA and AA , may cause bias in the selection of DMRs. This bias may lead to inappropriate DMRs for early diagnosis. Furthermore, the similar issues for the training set with a high percentage of late-stage CRC, no AA or NAA samples were included. This absence may be the key factor in screening changed methylated cfDNA that can predict the early stages of CRC.

      We are very grateful to the reviewer for this insightful methodological critique. We agree that cohort composition and sample size are critical factors in the development of robust biomarkers, and we appreciate the opportunity to clarify our study design and the interpretation of our results.

      We agree with the reviewer that the number of precancerous lesion samples (NAA and AA) in our initial plasma screening cohort was limited. This is a valid point. However, it is important to contextualize the role of this step within our overall multi-stage marker selection funnel. The markers evaluated in this plasma cohort were not discovered from this small sample set alone. They were the result of a rigorous pre-selection process based on large-scale public TCGA data and our own tissue-level sequencing. This robust, tissue-based validation ensured that only the most promising CRC-specific markers were advanced for plasma testing. Therefore, while the plasma cohort was modest in size, its purpose was to confirm the circulatory detectability of markers already known to have a strong tissue-of-origin signal, thereby mitigating the potential bias from a smaller discovery set.

      Our primary aim was to first build a model that could robustly and accurately identify a definitive cancer-specific methylation signal. By training the model on clear-cut invasive cancer cases versus healthy controls, we could isolate the most powerful and specific markers for established malignancy. Our working hypothesis was that these strong cancer-specific methylation patterns are initiated during the precursor stages and would therefore be detectable, albeit at lower levels, in precancerous lesions.  Unfortunately, the panel could only identify a limited proportion of precancerous lesions (48.4% in the NAA group and 52.2% in the AA group). We fully agree with the reviewer's sentiment that including a larger and more balanced set of precancerous lesions in future training cohorts could potentially optimize a model specifically for adenoma detection. We have now explicitly added this point to our Discussion section, highlighting it as an important direction for future research (Page 18, lines 367-373).

      (2) The sensitivity of 27 DMRs in the external validation set (for NAA, AA and CRC 0-Ⅱare 48.4%. 52.2% and 66.7%, respectively) were much lower compared with previously published studies, like ColonES assay (DOI: 10.1016/j.eclinm.2022.101717) and ColonSecure test (DOI: 10.1186/s12943-023-01866-z). The 27 DMRs from the layered screening process did not show superior performance in a small population of an external validation cohort. Therefore, it is unlikely that this DMR pattern will be applicable to the general population in the future.

      We sincerely thank the reviewer for their insightful comments and for providing a thorough comparison with the highly relevant ColonES and ColonSecure assays. This has given us an important opportunity to clarify the unique contributions and specific clinical applications of our 27-DMR panel.

      We acknowledge the reviewer's point that the sensitivities of our panel for precancerous lesions (NAA: 48.4%, AA: 52.2%), while substantial, are numerically lower than those reported by the excellent ColonES assay (AA: 79.0%). However, it is important to clarify that while the ColonES and ColonSecure tests are outstanding benchmarks designed primarily for early detection and screening, the primary objective and contribution of our study were slightly different. Our model demonstrated an exceptional ability to predict distant metastasis with an AUC of 0.955 and a strong capacity for predicting overall prognosis with an AUC of 0.867. Our goal was to develop a multi-functional, biologically-rooted biomarker panel that not only contributes to early detection but, more importantly, provides crucial information for post-diagnosis patient management, including staging, risk stratification, and prognostication, from a single preoperative sample. We believe this ability to preoperatively identify high-risk patients who may require more aggressive treatment or intensive surveillance is the key contribution of our work. It provides a distinct clinical utility that complements, rather than directly competes with, pure screening assays.

      We agree with the reviewer that our external validation was performed on a limited cohort, and we have acknowledged this as a limitation in our Discussion section. However, the purpose of this validation was to provide a proof-of-concept for the panel's performance across its multiple functions. The promising and exceptionally high-performing results in the prognostic domain strongly warrant further validation in larger, prospective, multi-center cohorts.

      (3) The 27 DMRs pattern worked well in predicting CRC distant metastasis, and the methylation score remarkably increased in stage III-IV. In contrast, the increase of AA and 0-II groups was very mild in the validation cohort. This observation raises concerns regarding the study design, particularly in the context of the layered screening process and sample assigning.

      We sincerely thank the reviewer for this insightful and critical comment. We agree with the reviewer's observation that the methylation score increased more remarkably in late-stage (III-IV) CRC compared to the milder increase in adenoma (AA) and early-stage (0-II) CRC in the validation cohort. However, the observed pattern is biologically plausible and consistent with the nature of colorectal cancer progression. Carcinogenesis is a multi-step process involving the gradual accumulation of genetic and epigenetic alterations. The methylation changes we identified are likely associated with tumor progression and metastasis. Therefore, it is expected that advanced, metastatic cancers (Stage III-IV), which have undergone significant biological changes, would exhibit a much stronger and more robust methylation signal compared to pre-cancerous lesions (adenomas) or early-stage, non-metastatic cancers (Stage 0-II). The "mild" increase in early stages reflects the initial, more subtle epigenetic alterations, while the "remarkable" increase in late stages reflects the extensive changes required for invasion and metastasis. We believe this graduated increase actually strengthens the validity of our methylation signature, as it mirrors the underlying biological progression of the disease. We hope this response and the corresponding revisions address the reviewer's comments.

      (4) The authors did not provide the 27 DMRs prediction efficacy comparison with other noninvasive CRC assays, like a CEA and a FIT test.

      Thank you for this valuable suggestion. We agree that comparing our model with established non-invasive assays is crucial for demonstrating its clinical potential. Following your advice, we have now included a direct comparison of the diagnostic performance between our model and the traditional tumor marker, carcinoembryonic antigen (CEA), using the external validation cohort. The results show that our model has a significantly higher sensitivity for detecting early-stage colorectal cancer and adenomas compared to CEA. This detailed comparison has been added as Table s7 in the supplementary materials, and the corresponding description has been incorporated into the Results section of our manuscript (Page 12, lines 234-236). Regarding the Fecal Immunochemical Test (FIT), we unfortunately could not perform a direct statistical comparison because very few individuals in our cohort had undergone FIT. A comparison based on such a small sample size would lack statistical power and might not yield meaningful conclusions. We have acknowledged this as a limitation of our study in the Discussion section.We believe these additions and clarifications have substantially strengthened our manuscript. Thank you again for your constructive feedback.

      (5) The authors did not explicitly describe how they assigned the plasma samples to the distinct sets, nor did they specify the criteria for the plasma screen set, training set, and validation set. The detailed information for the patient grouping should be listed.

      Responce: Thank you for this essential feedback. We agree that a transparent and detailed description of the sample allocation process is crucial for the manuscript. We apologize for the previous lack of clarity and have now revised the Methods section to address this. Our patient cohorts were assigned to the screening, training, and validation sets based on a chronological splitting strategy. Specifically, samples were allocated based on the date of collection in a consecutive manner. This approach was chosen to minimize selection bias and to provide a more realistic, forward-looking assessment of the model's performance, simulating a prospective validation scenario. The screening set comprised 89 tissue samples and 77 plasma samples collected between June to December 2020. The primary purpose of this set was for the initial discovery and screening of potential methylation markers. The training set and validation set included 165 plasma samples collected from December 2020 to July 2022. The external validation cohort comprised 166 plasma samples collected from from July 2022 to December 2022. The subsection titled "Study design and samples" within the Methods section of the revised manuscript, which now contains all of this detailed information (Page 6, lines 116-133). We believe this detailed explanation now makes our study design clear and transparent. Thank you again for helping us improve our manuscript.

      Reviewer #2 (Recommendations for the authors):

      The manuscript requires significant language editing to improve clarity and readability. We recommend that the authors seek professional editing services for revision.

      Thank you for your constructive comments on the language of our manuscript. We apologize for any lack of clarity in the previous version. To address this, we have performed a thorough revision of the manuscript. The text has been carefully reviewed and edited by a native English-speaking colleague who is an expert in our research field. We have focused on correcting all grammatical errors, improving sentence structure, and refining the phrasing throughout the document to enhance readability. We are confident that these extensive revisions have significantly improved the clarity of the manuscript. We hope you will find the current version much easier to read and understand.

      Reviewer #3 (Recommendations for the authors):

      (1) However, I think the abstract part of the article is too detailed and should be more concise and shortened. It is not necessary to show detailed values but to summarize the results.

      Thank you for this valuable suggestion. We agree that the previous version of the abstract was overly detailed and that a more concise summary would be more effective for the reader. Following your advice, we have substantially revised the abstract. We have removed the specific numerical values (such as detailed statistics) and have instead focused on summarizing the key findings and their broader implications (Page 3, lines 54-60, 64-66, 70-72). The revised abstract is now shorter and provides a clearer, high-level overview of our study's background, methods, main results, and conclusions. We believe these changes have significantly improved its readability and impact. We hope you will find the current version more appropriate.

      (2) Figure 4, the color in the legend and plot are not the same, and should be revised.

      Thank you for your careful attention to detail and for pointing out the color inconsistency in Figure 4. We apologize for this oversight. We have now corrected the figure as you suggested, ensuring that the colors in the legend perfectly match those in the plot. The revised Figure 4 has been updated in the manuscript. We appreciate your help in improving the quality of our figures.

      (3) Please pay attention to the article format, such as the consistency of fonts and punctuation marks. (For example, Lines 75 and Line 230).

      Thank you for your meticulous review and for pointing out the inconsistencies in our manuscript's formatting. We sincerely apologize for these oversights and any inconvenience they may have caused. Following your feedback, we have carefully corrected the specific issues you highlighted. Furthermore, we have conducted a thorough proofread of the entire manuscript to ensure consistency in all fonts, punctuation marks, and overall adherence to the journal's formatting guidelines. We appreciate your help in improving the presentation and professionalism of our paper.

    1. Author response:

      (1) General Statements

      We thank the Reviewers for a fair review of our work and helpful suggestions. We have significantly revised the manuscript in response to these suggestions. We provide a point-by-point response to the Reviewers below but wanted to highlight in our response a recurring concern related to the strong cell cycle arrest observed upon the acute FAM53C knock-down being different than the limited phenotypes in other contexts, including the knockout mice and DepMap data.

      First, we now show that we can recapitulate the strong G1 arrest resulting from the FAM53C knock-down using two independent siRNAs in RPE-1 cells, supporting the specificity of the effects.

      Second, the G1 arrest that results from the FAM53C knock-down is also observed in cells with inactive p53, suggesting it is not due to a non-specific stress response due to “toxic” siRNAs. In addition, the arrest is dependent on RB, which fits with the genetic and biochemical data placing FAM53C upstream of RB, further supporting a specific phenotype.

      Third, we have performed experiments in other human cells, including cancer cell lines. As would be expected for cancer cells, the G1 arrest is less pronounced but is still significant, indicating that the G1 arrest is not unique to RPE-1 cells.

      Fourth, it is not unexpected that compensatory mechanisms would be activated upon loss of FAM53C during development or in cancer – which may explain the lack of phenotypes in vivo or upon long-term knockout. This has been true for many cell cycle regulators, either because of compensation by other family members that have overlapping functions, or by a larger scale rewiring of signaling pathways. 

      (2) Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity): 

      Summary: 

      Taylar Hammond and colleagues identified new regulators of the G1/S transition of the cell cycle.

      They did so by screening public available data from the Cancer Dependency Map, and identified FAM53C as a positive regulator of the G1/S transition. Using biochemical assays they then show that FAM53 interacts with the DYRK1A kinase to inhibit its function. DYRK1A in its is known to induce degradation of cyclin D, leading the authors to propose a model in which DYRK1Adependent cyclin D degradation is inhibited by FAM53C to permit S-phase entry. Finally the authors assess the effect of FAM53C deletion in a cortical organoid model, and in Fam53c knockout mice. Whereas proliferation of the organoids is indeed inhibited, mice show virtually no phenotype.  

      Major comments: 

      The authors show convincing evidence that FAM53C loss can reduce S-phase entry in cell cultures, and that it can bind to DYRK1A. However, FAM53 has multiple other binding partners and I am not entirely convinced that negative regulation of DYRK1A is the predominant mechanism to explain its effects on S-phase entry. Some of the claims that are made based on the biochemical assays, and on the physiological effects of FAM53C are overstated. In addition, some choices made methodology and data representation need further attention. 

      (1) The authors do note that P21 levels increase upon FAM53C. They show convincing evidence that this is not a P53-dependent response. But the claim that " p21 upregulation alone cannot explain the G1 arrest in FAM53C-deficient cells (line 138-139) is misleading. A p53-independent p21 response could still be highly relevant. The authors could test if FAM53C knockdown inhibits proliferation after p21 knockdown or p21 deletion in RPE1 cells. 

      The Reviewer raises a great point. Our initial statement needed to be clarified and also need more experimental support. We have performed experiments where we knocked down FAM53C and p21 individually, as well as in combination, in RPE-1 cells. These experiment show that p21 knock-down is not sufficient to negate the cell cycle arrest resulting from the FAM53C knockdown in RPE-1 cells (Figure 4B,C and Figure S4C,D).

      We now extended these experiments to conditions where we inhibited DYRK1A, and we also compared these data to experiments in p53-null RPE-1 cells. Altogether, these experiments point to activation of p53 downstream of DYRK1A activation upon FAM53C knock-down, and indicate that p21 is not the only critical p53 target in the cell cycle arrest observed in FAM53C knock-down cells (Figure 4 and Figure S4).

      (2) The authors do not convincingly show that FAM53C acts as a DYRK1A inhibitor in cells. Figures 4B+C and S4B+C show extremely faint P-CycD1 bands, and tiny differences in ratios. The P values are hovering around the 0.05, so n=3 is clearly underpowered here. Total CycD1 levels also correlate with FAM53C levels, which seems to affect the ratios more than the tiny pCycD1 bands. Why is there still a pCycD1 band visible in 4B in the GFP + BTZ + DYRK1Ai condition? And if I look at the data points I honestly don't understand how the authors can conclude from S4C that knockdown of siFAM53C increases (DYRK1A dependent) increases in pCycD1 (relative to total CycD1). In figure 5C, no blot scans are even shown, and again the differences look tiny. So the authors should either find a way to make these assays more robust, or alter their claims appropriately. 

      We appreciate these comments from the Reviewer and have significantly revised the manuscript to address them.

      The analysis of Cyclin D phosphorylation and stability are complicated by the upregulation of p21 upon FAM53C knock-down, in particular because p21 can be part of Cyclin D complexes, which may affect its protein levels in cells (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). Instead of focusing on Cyclin D levels and stability, we refocused the manuscript on RB and p53 downstream of FAM53C loss.

      We removed previous panel 4B from the revised manuscript. For panels 4E and S4B (now panels S3J and S3K)), we used a true “immunoassay” (as indicated in the legend – not an immunoblot), which is much more quantitative and avoids error-prone steps in standard immunoblots (“Western blots”). Briefly, this system was developed by ProteinSimple. It uses capillary transfer of proteins and ELISA-like quantification with up to 6 logs of dynamic range (see their web site https://www.proteinsimple.com/wes.html). The “bands” we show are just a representation of the luminescence signals in capillaries. We made sure to further clarify the figure legends in the revised manuscript.

      The representative Western blot images for 5C-D (now 5F-G) in the original submission are shown in Figure 5E, we apologize if this was not clear. The differences are small, which we acknowledge in the revised manuscript. Note that several factors can affect Cyclin D levels in cells, including the growth rate and the stage of the cell cycle. Our FACS analysis shows that normal organoids have ~63% of cells in G1 and ~13% in S phase; the overall lower proportion of S-phase cells in organoids may make the immunoblot difference appear smaller, with fewer cycling cells resulting in decreased Cyclin D phosphorylation.

      Nevertheless, the Reviewer brings up a good point and comments from this Reviewer and the others made us re-think how to best interpret our results. As discussed above, we re-read carefully the Meyer paper and think that FAM53C’s role and DYRK1A activity in cells may be understood when considering levels of both CycD and p21 at the same time in a continuum. While our genetic and biochemical data support a role for FAM53C in DYRK1A inhibition, it is likely that the regulation of cell cycle progression by FAM53C is not exclusively due to this inhibition. As discussed above and below, we noted an upregulation of p21 upon FAM53C knock-down, and activation of p53 and its targets likely contributes significantly to the phenotypes observed. We added new experiments to support this more complex model (Figure 4 and Figure S4, with new model in S4L).

      (3) The experiments to test if DYRK1A inhibition could rescue the G1 arrest observed upon FAM53C knockdown are not entirely convincing either. It would be much more convincing if they also perform cell counting experiments as they have done in Figures 1F and 1G, to complement the flow cytometry assays. I suggest that the authors do these cell counting experiments in RPE1 +/- P53 cells as well as HCT116 cells. In addition, did the authors test if P21 is induced by DYRK1Ai in HCT116 cells? 

      We repeated the experiments with the DYRK1A inhibitor and counted the cells. In p53-null RPE1 cells, we found that cell numbers do not increase in these conditions where we had observed a cell cycle re-entry (Fig. 4E), which was accompanied by apoptotic cell death (Fig. S4I). Thus, cells re-enter the cell cycle but die as they progress through S-phase and G2/M. We note that inhibition of DYRK1A has been shown to decrease expression of G2/M regulators (PMID: 38839871), which may contribute to the inability of cells treated to DYRK1Ai to divide. Because our data in RPE-1 cells showed that p21 knock-down was not sufficient to allow the FAM53C knock-down cells to re-enter the cell cycle, we did not further analyze p21 in HCT-116 cells.

      (4) The data in Figure 5C and 5D are identical, although they are supposed to represent either pCycD1 ratios or p21 levels. This is a problem because at least one of the two cannot be true. Please provide the proper data and show (representative) images of both data types.

      We apologize for these duplicated panels in the original submission. We now replaced the wrong panel with the correct data (Fig. 5F,G). 

      (5) Line 246: "Fam53c knockout mice display developmental and behavioral defects." I don't agree with this claim. The mutant mice are born at almost the expected Mendelian ratios, the body weight development is not consistently altered. But more importantly, no differences in adult survival or microscopic pathology were seen. The authors put strong emphasis on the IMPC behavioral analysis, but they should be more cautious. The IMPC mouse cohorts are tested for many other phenotypes related to behavior and neurological symptoms and apparently none of these other traits were changed in the IMPC Famc53c-/- cohort. Thus, the decreased exploration in a new environment could very well be a chance finding. The authors need to take away claims about developmental and behavioral defects from the abstract, results and discussion sections; the data are just too weak to justify this. 

      We agree with the Reviewer that, although we observed significant p-values, this original statement may not be appropriate in the biological sense. We made sure in the revised manuscript to carefully present these data.

      Minor comments: 

      (6) Can the authors provide a rationale for each of the proteins they chose to generate the list of the 38 proteins in the DepMap analysis? I looked at the list and it seems to me that they do not all have described functions in the G1/S transition. The analysis may thus be biased. 

      To address this point, we updated Table S1 (2nd tab) to provide a better rationale for the 38 factors chosen. Our focus was on the canonical RB pathway and we included RB binding proteins whose function had suggested they may also be playing a role in the G1/S transition. We do agree that there is some bias in this selection (e.g., there are more RB binding factors described) but we hope the Reviewer will agree with us that this list and the subsequent analysis identified expected factors, including FAM53C. Future studies using this approach and others will certainly identify new regulators of cell cycle progression.

      (7) Figure 1B is confusing to me. Are these just some (arbitrarily) chosen examples? Consider leaving this heatmap out altogether, of explain in more detail. 

      We agree with the Reviewer that this panel was not necessarily useful and possibly in the wrong place, and we removed it from the manuscript. We replaced it with a cartoon of top hits in the screen.

      (8) The y-axes in Figures 2C, 2D, 2E, and 4D are misleading because they do not start at 0. Please let the axis start at 0, or make axis breaks. 

      We re-graphed these panels.

      (9) Line 229: " Consequences ... brain development." This subheader is misleading, because the in vitro cortical organoid system is a rather simplistic model for brain development, and far away from physiological brain development. Please alter the header. 

      We changed the header to “Consequences of FAM53C inactivation in human cortical organoids in culture”.

      (10) Figure S5F: the gating strategy is not clear to me. In particular, how do the authors know the difference between subG1 and G1 DAPI signals? Do they interpret the subG1 as apoptotic cells? If yes, why are there so many? Are the culturing or harvesting conditions of these organoids suboptimal? Perhaps the authors could consider doing IF stainings on EdU or BrdU on paraffin sections of organoids to obtain cleaner data?

      Thank you for your feedback. The subG1 population in the original Figure S5F represents cells that died during the dissociation step of the organoids for FACS analysis. To address this point, we performed live & dead staining to exclude dead cells and provide clearer data. We refined gating strategy for better clarity in the new S5F panel.

      (11) Figure S6A; the labeling seems incorrect. I would think that red is heterozygous here, and grey mutant. 

      We fixed this mistake, thank you. 

      Reviewer #1 (Significance): 

      The finding that the poorly studied gene FAM53C controls the G1/S transition in cell lines is novel and interesting for the cell cycle field. However, the lack of phenotypes in Famc53-/- mice makes this finding less interesting for a broader audience. Furthermore, the mechanisms are incompletely dissected. The importance of a p53-indepent induction of p21 is not ruled out. And while the direct inhibitory interaction between FAM53C and DYRK1A is convincing (and also reported by others; PMID: 37802655), the authors do not (yet) convincingly show that DYRK1A inhibition can rescue a cell proliferation defect in FAM53C-deficient cells. 

      Altogether, this study can be of interest to basic researchers in the cell cycle field. 

      I am a cell biologist studying cell cycle fate decisions, and adaptation of cancer cells & stem cells to (drug-induced) stress. My technical expertise aligns well with the work presented throughout this paper, although I am not familiar with biolayer interferometry. 

      Reviewer #2 (Evidence, reproducibility and clarity): 

      Summary 

      In this study Hammond et al. investigated the role of Dual-specificity Tyrosine Phosphorylation regulated Kinase 1A (DYRK1) in G1/S transition. By exploiting Dependency Map portal, they identified a previously unexplored protein FAM53C as potential regulator of G1/S transition. Using RNAi, they confirmed that depletion of FAM53C suppressed proliferation of human RPE1 cells and that this phenotype was dependent on the presence protein RB. In addition, they noted increased level of CDKN1A transcript and p21 protein that could explain G1 arrest of FAM53Cdepleted cells but surprisingly, they did not observe activation of other p53 target genes. Proteomic analysis identified DYRK1 as one of the main interactors of FAM53C and the interaction was confirmed in vitro. Further, they showed that purified FAM53C blocked the ability of DYRK1 to phosphorylate cyclin D in vitro although the activity of DYRK1 was likely not inhibited (judging from the modification of FAM53C itself). Instead, it seems more likely that FAM53C competes with cyclin D in this assay. Authors claim that the G1 arrest caused by depletion of FAM53C was rescued by inhibition of DYRK1 but this was true only in cells lacking functional p53. This is quite confusing as DYRK1 inhibition reduced the fraction of G1 cells in p53 wild type cells as well as in p53 knock-outs, suggesting that FAM53C may not be required for regulation of DYRK1 function. Instead of focusing on the impact of FAM53C on cell cycle progression, authors moved towards investigating its potential (and perhaps more complex) roles in differentiation of IPSCs into cortical organoids and in mice. They observed a lower level of proliferating cells in the organoids but if that reflects an increased activity of DYRK1 or if it is just an off target effect of the genetic manipulation remains unclear. Even less clear is the phenotype in FAM53C knock-out mice. Authors did not observe any significant changes in survival nor in organ development but they noted some behavioral differences. Weather and how these are connected to the rate of cellular proliferation was not explored. In the summary, the study identified previously unknown role of FAM53C in proliferation but failed to explain the mechanism and its physiological relevance at the level of tissues and organism. Although some of the data might be of interest, in current form the data is too preliminary to justify publication.

      Major points 

      (1) Whole study is based on one siRNA to Fam53C and its specificity was not validated. Level of the knock down was shown only in the first figure and not in the other experiments. The observed phenotypes in the cell cycle progression may be affected by variable knock-down efficiency and/or potential off target effects. 

      We thank the Reviewer for raising this important point. First, we need to clarify that our experiments were performed with a pool of siRNAs (not one siRNA). Second, commercial antibodies against FAM53C are not of the best quality and it has been challenging to detect FAM53C using these antibodies in our hands – the results are often variable. In addition, to better address the Reviewer’s point and control for the phenotypes we have observed, we performed two additional series of experiments: first, we have confirmed G1 arrest in RPE-1 cells with individual siRNAs, providing more confidence for the specificity of this arrest (Fig. S1B); second, we have new data indicating that other cell lines arrest in G1 upon FAM53C knock-down (Fig. S1E,F and Fig. 4F).

      (2) Experiments focusing on the cell cycle progression were done in a single cell line RPE1 that showed a strong sensitivity to FAM53C depletion. In contrast, phenotypes in IPSCs and in mice were only mild suggesting that there might be large differences across various cell types in the expression and function of FAM53C. Therefore, it is important to reproduce the observations in other cell types. 

      As mentioned above, we have new data indicating that other cell lines arrest in G1 upon FAM53C knock-down (three cancer cell lines) (Fig. S1E,F and Fig. 4F).

      (3) Authors state that FAM53C is a direct inhibitor of DYRK1A kinase activity (Line 203), however this model is not supported by the data in Fig 4A. FAM53C seems to be a good substrate of DYRK1 even at high concentrations when phosphorylations of cyclin D is reduced. It rather suggests that DYRK1 is not inhibited by FAM53C but perhaps FAM53C competes with cyclin D. Further, authors should address if the phosphorylation of cyclin D is responsible for the observed cell cycle phenotype. Is this Cyclin D-Thr286 phosphorylation, or are there other sites involved? 

      We revised the text of the manuscript to include the possibility that FAM53C could act as a competitive substrate and/or an inhibitor.

      We removed most of the Cyclin D phosphorylation/stability data from the revised manuscript. As the Reviewers pointed out, some of these data were statistically significant but the biological effects were small. As discussed above in our response to Reviewer #1, the analysis of Cyclin D phosphorylation and stability are complicated by the upregulation of p21 upon FAM53C knockdown, in particular because p21 can be part of Cyclin D complexes, which may affect its protein levels in cells (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). Instead of focusing on Cyclin D levels and stability, we refocused the manuscript on RB and p53 downstream of FAM53C loss.

      We note, however, that we used specific Thr286 phospho-antibodies, which have been used extensively in the field. Our data in Figure 1 with palbociclib place FAM53C upstream of Cyclin D/CDK4,6. We performed Cyclin D overexpression experiments but RPE-1 cells did not tolerate high expression of Cyclin D1 (T286A mutant) and we have not been able to conduct more ‘genetic’ studies. 

      (4) At many places, information on statistical tests is missing and SDs are not shown in the plots. For instance, what statistics was used in Fig 4C? Impact of FAM53C on cyclin D phosphorylation does not seem to be significant. In the same experiment, does DYRK1 inhibitor prevent modification of cyclin D? 

      As discussed above, we removed some of these data and re-focused the manuscript on p53-p21 as a second pathway activated by loss of FAM53C.

      (5) Validation of SM13797 compound in terms of specificity to DYRK1 was not performed. 

      This is an important point. We had cited an abstract from the company (Biosplice) but we agree that providing data is critical. We have now revised the manuscript with a new analysis of the compound’s specificity using kinase assays. These data are shown in Fig. S3F-H.

      (6) A fraction of cells in G1 is a very easy readout but it does not measure progression through the G1 phase. Extension of the S phase or G2 delay would indirectly also result in reduction of the G1 fraction. Instead, authors could measure the dynamics of entry to S phase in cells released from a G1 block or from mitotic shake off. 

      The Reviewer made a good point. As discussed in our response to Reviewer #1, with p53-null RPE-1 cells, we found that cell numbers do not increase in these conditions where we had observed a cell cycle re-entry (Fig. 4E), which was accompanied by apoptotic cell death (Fig. S4I). Thus, cells re-enter the cell cycle but die as they progress through S-phase and G2/M. We note that inhibition of DYRK1A has been shown to decrease expression of G2/M regulators (PMID: 38839871), which may contribute to the inability of cells treated to DYRK1Ai to divide.

      Because our data in RPE-1 cells showed that p21 knock-down was not sufficient to allow the FAM53C knock-down cells to re-enter the cell cycle, we did not further analyze p21 in HCT-116 cells. These data indicate that G1 entry by flow cytometry will not always translate into proliferation.

      Other points:

      (7) Fig. 2C, 2D, 2E graphs should begin with 0 

      We remade these graphs.

      (8) Fig. 5D shows that the difference in p21 levels is not significant in FAM53C-KO cells but difference is mentioned in the text. 

      We replaced the panel by the correct panel; we apologize for this error.

      (9) Fig. 6D comparison of datasets of extremely different sizes does not seem to be appropriate

      We agree and revised the text. We hope that the Reviewer will agree with us that it is worth showing these data, which are clearly preliminary but provide evidence of a possible role for FAM53C in the brain.

      (10) Could there be alternative splicing in mice generating a partially functional protein without exon 4? Did authors confirm that the animal model does not express FAM53C? 

      We performed RNA sequencing of mouse embryonic fibroblasts derived from control and mutant mice. We clearly identified fewer reads in exon 4 in the knockout cells, and no other obvious change in the transcript (data not shown). However, immunoblot with mouse cells for FAM53C never worked well in our hands. We made sure to add this caveat to the revised manuscript.

      Reviewer #2 (Significance): 

      Main problem of this study is that the advanced experimental models in IPSCs and mice did not confirm the observations in the cell lines and thus the whole manuscript does not hold together. Although I acknowledge the effort the authors invested in these experiments, the data do not contribute to the main conclusion of the paper that FAM53C/DYRK1 regulates G1/S transition. 

      Reviewer #3 (Evidence, reproducibility and clarity: 

      This paper identifies FAM53C as a novel regulator of cell cycle progression, particularly at the G1/S transition, by inhibiting DYRK1A. Using data from the Cancer Dependency Map, the authors suggest that FAM53C acts upstream of the Cyclin D-CDK4/6-RB axis by inhibiting DYRK1A.  Specifically, their experiments suggest that FAM53C Knockdown induces G1 arrest in cells, reducing proliferation without triggering apoptosis. DYRK1A Inhibition rescues G1 arrest in P53KO cells, suggesting FAM53C normally suppresses DYRK1A activity. Mass Spectrometry and biochemical assays confirm that FAM53C directly interacts with and inhibits DYRK1A. FAM53C Knockout in Human Cortical Organoids and Mice leads to cell cycle defects, growth impairments, and behavioral changes, reinforcing its biological importance. 

      Strength of the paper: 

      The study introduces a novel cell cycle control signalling module upstream of CDK4/6 in G1/S regulation which could have significant impact. The identification of FAM53C using a depmap correlation analysis is a nice example of the power of this dataset. The experiments are carried out mostly in a convincing manner and support the conclusions of the manuscript. 

      Critique: 

      (1) The experiments rely heavily on siRNA transfections without the appropriate controls. There are so many cases of off-target effects of siRNA in the literature, and specifically for a strong phenotype on S-phase as described here, I would expect to see solid results by additional experiments. This is especially important since the ko mice do not show any significant developmental cell cycle phenotypes. Moreover, FAM53C does not show a strong fitness effect in the depmap dataset, suggesting that it is largely non-essential in most cancer cell lines. For this paper to reach publication in a high-standard journal, I would expect that the authors show a rescue of the S-phase phenotype using an siRNA-resistant cDNA, and show similar S-phase defects using an acute knock out approach with lentiviral gRNA/Cas9 delivery. 

      We thank the Reviewer for this comment. Please refer to the initial response to the three Reviewers, where we discuss our use of single siRNAs and our results in multiple cell lines. Briefly, we can recapitulate the G1 arrest upon FAM53C knock-down using two independent siRNAs in RPE-1 cells. We also observe the same G1 arrest in p53 knockout cells, suggesting it is not due to a non-specific stress response. In addition, the arrest is dependent on RB, which fits with the genetic and biochemical data placing FAM53C upstream of RB, further supporting a specific phenotype. Human cancer cell lines also arrest in G1 upon FAM53C knock-down, not just RPE-1 cells. Finally, we hope the Reviewer will agree with us that compensatory mechanisms are very common in the cell cycle – which may explain the lack of phenotypes in vivo or upon long-term knockout of FAM53C.

      (2) The S-phase phenotype following FAM53C should be demonstrated in a larger variety of TP53WT and mutant cell lines. Given that this paper introduces a new G1/S control element, I think this is important for credibility. Ideally, this should be done with acute gRNA/Cas9 gene deletion using a lentiviral delivery system; but if the siRNA rescue experiments work and validate an on-target effect, siRNA would be an appropriate alternative. 

      We now show data with three cancer cell lines (U2OS, A549, and HCT-116 – Fig. S1E,F and Fig. 4F), in addition to our results in RPE-1 cells and in human cortical organoids. We note that the knock-down experiments are complemented by overexpression data (Fig. 1G-I), by genetic data (our original DepMap screen), and our biochemical data (showing direct binding of FAM53C to DYRK1A).

      (3) The western blot images shown in the MS appear heavily over-processed and saturated (See for example S4B, 4A, B, and E). Perhaps the authors should provide the original un-processed data of the entire gels? 

      For several of our panels (e.g., 4E and S4B, now panels S3J and S3K)), we used a true “immunoassay” (as indicated in the legend – not an immunoblot), which is much more quantitative and avoids error-prone steps in standard immunoblots (“Western blots”). Briefly, this system was developed by ProteinSimple. It uses capillary transfer of proteins and ELISA-like quantification with up to 6 logs of dynamic range (see their web site https://www.proteinsimple.com/wes.html). The “bands” we show are just a representation of the luminescence signals in capillaries. We made sure to further clarify the figure legends in the revised manuscript.

      Data in 4A are also not a western blot but a radiograph.

      For immunoblots, we will provide all the source data with uncropped blots with the final submission.

      (4) A critical experiment for the proposed mechanism is the rescue of the FAM53C S-phase reduction using DYRK1A inhibition shown in Figure 4. The legend here states that the data were extracted from BrdU incorporation assays, but in Figure S4D only the PI histograms are shown, and the S-phase population is not quantified. The authors should show the BrdU scatterplot and quantify the phenotype using the S-phase population in these plots. G1 measurements from PI histograms are not precise enough to allow for conclusions. Also, why are the intensities of the PI peaks so variable in these plots? Compare, for example, the HCT116 upper and lower panels where the siRNA appears to have caused an increase in ploidy. 

      We apologize for the confusion and we fixed these errors, for most of the analyses, we used PI to measure G1 and S-phase entry. We added relevant flow cytometry plots to supplemental figures (Fig. S1G, H, I, as well as Fig. S4E and S4K, and Fig. S5F).

      (5) There's an apparent contradiction in how RB deletion rescues the G1 arrest (Figure 2) while p21 seems to maintain the arrest even when DYRK1A is inhibited. Is p21 not induced when FAM53C is depleted in RB ko cells? This should be measured and discussed. 

      This comment and comments from the two other Reviewers made us reconsider our model. We re-read carefully the Meyer paper and think that DYRK1A activity may be understood when considering levels of both CycD and p21 at the same time in a continuum (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). While our genetic and biochemical data support a role for FAM53C in DYRK1A inhibition, it is obvious that the regulation of cell cycle progression by FAM53C is not exclusively due to this inhibition. As discussed above and below, we noted an upregulation of p21 upon FAM53C knock-down, and activation of p53 and its targets likely contributes significantly to the phenotypes observed. We added new experiments to support this more complex model (Figure 4 and Figure S4, with new model in S4L).

      Reviewer #3 (Significance): 

      In conclusion, I believe that this MS could potentially be important for the cell cycle field and also provide a new target pathway that could be relevant for cancer therapy. However, the paper has quite a few gaps and inconsistencies that need to be addressed with further experiments. My main worry is that the acute depletion phenotypes appear so strong, while the gene is nonessential in mice and shows only a minor fitness effect in the depmap screens. More convincing controls are necessary to rule out experimental artefacts that misguide the interpretation of the results.

      We appreciate this comment and hope that the Reviewer will agree it is still important to share our data with the field, even if the phenotypes in mice are modest.

    1. Author response:

      Reviewer #1 (Public review): 

      Summary: 

      Cotton et al. investigated the role of tusB in antibiotic tolerance in Yersinia pseudotuberculosis. They used the IP2226 strain and introduced appropriate mutations and complementation constructs. Assays were performed to measure growth rates, antibiotic tolerance, tRNA modification, gene expression and proteomic profiles. In addition, experiments to measure ribosome pausing and bioinformatic analysis of codon usage in ribosomal proteins provided in-depth mechanistic support for the conclusions. 

      Strengths: 

      The findings are consistent with the authors having uncovered new mechanistic insights into bacterial antibiotic tolerance mediated by reducing ribosomal protein abundance. 

      Weaknesses: 

      Since the WT strain grows faster than the tusB mutant, there is a question of how growth rate, per se, impacts some of the analysis done. The authors should address this issue. In addition, it may not be essential, but would analysis of another slow-growing mutant (in some other antibiotic tolerance pathway if available) serve as a good control in this context? 

      We would like to thank the reviewer for their time spent reviewing our manuscript and for their positive review. We plan to address their comment as to how growth rate impacts the analyses and plan to incorporate another slow-growing mutant in the revised version of the manuscript.

      Reviewer #2 (Public review): 

      Summary: 

      This study addresses a critical clinical challenge-bacterial antibiotic tolerance (a key driver of treatment failure distinct from genetic resistance)-by uncovering a novel regulatory role of the conserved s2U tRNA modification in Yersinia pseudotuberculosis. Its strengths are notable and lay a solid foundation for understanding phenotypic drug tolerance. The study is the first to link s2U tRNA modification loss to antibiotic tolerance, specifically targeting translation/transcription-inhibiting antibiotics (doxycycline, gentamicin, rifampicin). By establishing a causal chain - s2U deficiency → codon-specific ribosome pausing (at AAA/CAA/GAA) → reduced ribosomal protein translation → global translational suppression → tolerance - it expands the functional landscape of tRNA modifications beyond canonical translation fidelity, filling a gap in how RNA epigenetics shapes bacterial stress adaptation. 

      Strengths: 

      This study makes a valuable contribution to understanding tRNA modification-mediated antibiotic tolerance. 

      Weaknesses: 

      There are several limitations that weaken the robustness of the study's mechanistic conclusions. Addressing these gaps would significantly enhance its impact and translational potential. 

      We would like to thank the reviewer for their time spent reviewing our manuscript, and for both their positive comments about the significance and novelty of this work as well as their critiques. We plan to address their specific recommendations in the revised manuscript by focusing on the contribution of specific ribosomal proteins (i.e. the 30S subunit protein, S13) through overexpression, codon replacement, and stability experiments. We also plan to design experiments to assess in vivo relevance and assess possible impacts on other pathways involved in antibiotic tolerance.

      Reviewer #3 (Public review): 

      Summary: 

      In the manuscript of Cotten et al., the authors study the 2-thiolation of tRNA in bacterial antibiotic resistance. The wildtype organism, Yersinia pseudotuberculosis, downregulates 2-thiolation as a response to antibiotics targeting the ribosome. In this manuscript, the authors show that a knockout of tusB causes slower translation. They provide evidence on the mechanisms of the slowing by determining transcription and translation, ribosome profiling and performing codon-usage analysis. They successfully determined that 2 codons are drivers of the translation slowdown, and the data is highly conclusive. Technically, I have nothing to criticize. 

      Strengths: 

      All in all, the study is very well made, and the writing is clear and concise. It covers a wide array of state-of-the-art analyses to unravel the interplay of tRNA modifications in translation. 

      Weaknesses: 

      The only question that remains to be asked is why the slowed translation leads to a better survival of the bacteria under antibiotic stress. In my opinion, the mechanism itself remains unclear. Thus, the statement that "We expect that this reduction in ribosomal proteins is globally reducing the translational capacity of the cell and is responsible for inducing tolerance to ribosome and RNA polymerase-targeting antibiotics" does not truly emphasize the remaining open question of why slowed translation favors survival. Therefore, I would recommend a minor text revision. 

      We would like to thank the reviewer for their time spent reviewing our manuscript and for their positive review of the technical aspects, experimental design, and writing. We will incorporate their suggested text revision into the revised manuscript, and will add to this statement if additional planned experiments shed light on this remaining question.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment:

      This valuable study examines how mammals descend effectively and securely along vertical substrates. The conclusions from comparative analyses based on behavioral data and morphological measurements collected from 21 species across a wide range of taxa are convincing, making the work of interest to all biologists studying animal locomotion.

      We would like to greatly thank the two reviewers for their time in reviewing this work, and for their valuable comments and suggestions that will help to improve this manuscript.

      Overall, we agree with the weaknesses raised, which are mainly areas for consideration in future studies: to study more species, and in a natural habitat context.

      We will nevertheless add a few modifications to improve the manuscript, notably by making certain figures more readable, and adding definitions and bibliography in the main text concerning gait characteristics.

      We also provide brief comments on each point of weakness raised by the reviewers below, in blue.

      Reviewer #1 (Public review):

      Summary:

      This unique study reports original and extensive behavioral data collected by the authors on 21 living mammal taxa in zoo conditions (primates, tree shrew, rodents, carnivorans, and marsupials) on how descent along a vertical substrate can be done effectively and securely using gait variables. Ten morphological variables reflecting head size and limb proportions are examined in relationship to vertical descent strategies and then applied to reconstruct modes of vertical descent in fossil mammals.

      Strengths:

      This is a broad and data-rich comparative study, which requires a good understanding of the mammal groups being compared and how they are interrelated, the kinematic variables that underlie the locomotion used by the animals during vertical descent, and the morphological variables that are associated with vertical descent styles. Thankfully, the study presents data in a cogent way with clear hypotheses at the beginning, followed by results and a discussion that addresses each of those hypotheses using the relevant behavioral and morphological variables, always keeping in mind the relationships of the mammal groups under investigation. As pointed out in the study, there is a clear phylogenetic signal associated with vertical descent style. Strepsirrhine primates much prefer descending tail first, platyrrhine primates descend sideways when given a choice, whereas all other mammals (with the exception of the raccoon) descend head first. Not surprisingly, all mammals descending a vertical substrate do so in a more deliberate way, by reducing speed, and by keeping the limbs in contact for a longer period (i.e., higher duty factors).

      Weaknesses:

      The different gait patterns used by mammals during vertical descent are a bit more difficult to interpret. It is somewhat paradoxical that asymmetrical gaits such as bounds, half bounds, and gallops are more common during descent since they are associated with higher speeds and lower duty factors. Also, the arguments about the limb support polygons provided by DSDC vs. LSDC gaits apply for horizontal substrates, but perhaps not as much for vertical substrates.

      We analyzed gait patterns using methods commonly found in the literature and discussed our results accordingly. However, the study of limbs support polygons was indeed developed specifically for studying locomotion on horizontal supports, and may not be applicable for studying vertical locomotion, which is in fact a type of locomotion shared by all arboreal species. In the future, it would be interesting to consider new methods for analyzing vertical gaits.

      The importance of body mass cannot be overemphasized as it affects all aspects of an animal's biology. In this case, larger mammals with larger heads avoid descending head-first. Variation in trunk/tail and limb proportions also covaries with different vertical descent strategies. For example, a lower intermembral index is associated with tail-first descent. That said, the authors are quick to acknowledge that the five lemur species of their sample are driving this correlation. There is a wide range of intermembral indices among primates, and this simple measure of forelimb over hindlimb has vital functional implications for locomotion: primates with relatively long hindlimbs tend to emphasize leaping, primates with more even limb proportions are typically pronograde quadrupeds, and primates with relatively long forelimbs tend to emphasize suspensory locomotion and brachiation. Equally important is the fact that the intermembral index has been shown to increase with body mass in many primate families as a way to keep functional equivalence for (ascending) climbing behavior (see Jungers, 1985). Therefore, the manner in which a primate descends a vertical substrate may just be a by-product of limb proportions that evolved for different locomotor purposes. Clearly, more vertical descent data within a wider array of primate intermembral indices would clarify these relationships. Similarly, vertical descent data for other primate groups with longer tails, such as arboreal cercopithecoids, and particularly atelines with very long and prehensile tails, should provide more insights into the relationship between longer tail length and tail-first descent observed in the five lemurs. The relatively longer hallux of lemurs correlates with tail-first descent, whereas the more evenly grasping autopods of platyrrhines allow for all four limbs to be used for sideways descent. In that context, the pygmy loris offers a striking contrast. Here is a small primate equipped with four pincer-like, highly grasping autopods and a tail reduced to a short stub. Interestingly, this primate is unique within the sample in showing the strongest preference for head-first descent, just like other non-primate mammals. Again, a wider sample of primates should go a long way in clarifying the morphological and behavioral relationships reported in this study.

      We agree with this statement. In the future, we plan to study other species, particularly large-bodied ones with varied intermembral indexes.

      Reconstruction of the ancient lifestyles, including preferred locomotor behaviors, is a formidable task that requires careful documentation of strong form-function relationships from extant species that can be used as analogs to infer behavior in extinct species. The fossil record offers challenges of its own, as complete and undistorted skulls and postcranial skeletons are rare occurrences. When more complete remains are available, the entire evidence should be considered to reconstruct the adaptive profile of a fossil species rather than a single ("magic") trait.

      We completely agree with this, and we would like to emphasize that our intention here was simply to conduct a modest inference test, the purpose of which is to provide food for thought for future studies, and whose results should be considered in light of a comprehensive evolutionary model.

      Reviewer #2 (Public review):

      Summary:

      This paper contains kinematic analyses of a large comparative sample of small to medium-sized arboreal mammals (n = 21 species) traveling on near-vertical arboreal supports of varying diameter. This data is paired with morphological measures from the extant sample to reconstruct potential behaviors in a selection of fossil euarchontaglires. This research is valuable to anyone working in mammal locomotion and primate evolution.

      Strengths:

      The experimental data collection methods align with best research practices in this field and are presented with enough detail to allow for reproducibility of the study as well as comparison with similar datasets. The four predictions in the introduction are well aligned with the design of the study to allow for hypothesis testing. Behaviors are well described and documented, and Figure 1 does an excellent job in conveying the variety of locomotor behaviors observed in this sample. I think the authors took an interesting and unique angle by considering the influence of encephalization quotient on descent and the experience of forward pitch in animals with very large heads.

      Weaknesses:

      The authors acknowledge the challenges that are inherent with working with captive animals in enclosures and how that might influence observed behaviors compared to these species' wild counterparts. The number of individuals per species in this sample is low; however, this is consistent with the majority of experimental papers in this area of research because of the difficulties in attaining larger sample sizes.

      Yes, that is indeed the main cost/benefit trade-off with this type of study. Working with captive animals allows for large comparative studies, but there is a risk of variations in locomotor behavior among individuals in the natural environment, as well as few individuals per species in the dataset. That is why we plan and encourage colleagues to conduct studies in the natural environment to compare with these results. However, this type of study is very time-consuming and requires focusing on a single species at a time, which limits the comparative aspect.

      Figure 2 is difficult to interpret because of the large amount of information it is trying to convey.

      We agree that this figure is dense. One possible solution would be to combine species by phylogenetic groups to reduce the amount of information, as we did with Fig. 3 on the dataset relating to gaits. However, we believe that this would be unfortunate in the case of speed and duty factor because we would have to provide the complete figure in SI anyway, as the species-level information is valuable. We therefore prefer to keep this comprehensive figure here and we will enlarge the data points to improve their visibility, and provide the figure with a sufficiently high resolution to allow zooming in on the details.

      Reviewer #1 (Recommendations for the authors):

      As indicated in the first section above, this is a strong comparative study that addresses important questions, relative to the evolution of arboreal locomotion in primates and close mammal relatives. My recommendations should be taken in the context of improving a manuscript that is already generally acceptable.

      (1) The terms symmetrical and asymmetrical gaits should be briefly defined in the main text (not just in the Methods section) by citing work done by Hildebrand and other relevant studies. To that effect, the statement on lines 96-97 about the convergence of symmetrical gaits is unclear. What does "Symmetrical gaits have evolved convergently in rodents, scandentians, carnivorans, and marsupials" mean? Symmetrical gaits such as the walk, run, trot, etc., are pretty the norm in most mammals and were likely found in metatherians and basal eutherians. This needs clarification. On line 239, the term "ambling" is used in the context of related asymmetrical gaits. To be clear, the amble is a type of running gait involving no whole-body aerial phase and is therefore a symmetrical gait (see Schmitt et al., 2006).

      We have added a definition of the terms symmetrical and asymmetrical gaits and added references in the introduction such as: “Symmetrical gaits are defined as locomotor patterns in which the footfalls of a girdle (a pair of fore- or hindlimbs) are evenly spaced in time, with the right and left limbs of a pair of limbs being approximately 50% out of phase with each other (Hildebrand, 1966, 1967). Symmetrical gaits can be further divided into two types: diagonal-sequence gaits, in which a hindlimb footfall is followed by that of the contralateral forelimb, and lateral-sequence gaits, in which a hindlimb footfall is followed by that of the ipsilateral forelimb (Hildebrand, 1967; Shapiro and Raichlen, 2005; Cartmill et al., 2007b). In contrast, asymmetrical gaits are characterized by unevenly spaced footfalls within a girdle, with the right and left limbs moving in near synchrony (Hildebrand, 1977).” Now found in lines 87-94.

      We corrected the sentence such as “Symmetrical gaits are also common in rodents, scandentians, etc..” Now found in line 107.

      Thank you for pointing this out. We indeed did not use the right term to mention related asymmetrical gaits with increased duty factors. We removed the term « ambling » and the associated reference here. Now found in line 256.

      (2) Correlations are used in the paper to examine how brain mass scales with body mass. It is correct to assume that a correlation significantly different from 0 is indicative of allometry (in this case, positive). That said, lines are used in Figure S2 that go through the bivariate scatter plot. The vast majority of scaling studies rely on regression techniques to calculate and compare slopes, which are different statistically from correlations. In this case, a slope not significantly different from 1.0 would support the hypothesis of isometry based on geometric similarity (as brain mass and body mass are two volumes). The authors could refer to the work of Bob Martin and the 1985 edited book by Jungers and contributions therein. These studies should also be cited in the paper.

      Thank you for recommending us this better suited method. We replaced the correlations with major axis orthogonal regressions, as recommended by Martin and Barbour 1989. We found a positive slope for all species significantly different from 1 (0.36), indicating a negative allometry (we realized we were mistaken about the allometry terminology, initially reporting a “positive allometry” instead of a positive correlation).

      We corrected in the manuscript in the Results and Methods sections, and cited Martin and Barbour 1989 such as:

      “To ensure that the EQs of the different species studied are comparable and meaningful, we tested the allometry between the brain and body masses in our dataset following [84] and found a significant and positive slope for all species (major axis orthogonal regression on log transformed values: slope = 0.36, r<sup>2</sup> = 0.92, p = 5.0.10<sup>-12</sup>), indicating a negative allometry (r = 0.97, df = 19, p = 2.0.10<sup>-13</sup>), and similar allometric coefficients when restricting the analysis to phylogenetic groups (Fig. S2).” Now found in lines 289-298.

      - “To control that brain allometry is homogeneous among all phylogenetic groups, to be able to compare EQ between species, we computed major axis orthogonal regressions, following the recommendation of Martin and Barbour [84], between the Log transformed brain and body masses, over all species and by phylogenetic group using the sma package in R (Fig. S2).” Now found in lines 336-338.

      We also changed Figure S2 in Supplementary Information accordingly.

      (3) Trunk length is used as the denominator for many of the indices used in the study. In this way, trunk length is considered to be a proxy for body size. There should be a demonstration that trunk length scales isometrically with body mass in all of the mammals compared. If not the case, some of the indices may not be directly comparable.

      We did not use trunk length as a proxy for body mass, but to compute geometric body proportions in order to test whether intrinsic body proportions could be related to vertical descent behaviors, namely the length of the tail and of the fore- and hindlimbs relative to the animal. We chose those indices to quantify the capability of limbs to act as levers or counterweights to rotate the animals for this specific question of vertical descent behavior. We therefore do not think that body mass allometry with respect to trunk length is relevant to compare these indices across species here. Also, we don’t expect that trunk length (which is a single dimension) would scale isometrically with body mass, which scales more as a volume.

      (4) Given the numerous comparisons done in this study, a Bonferroni correction method should be considered to mitigate type I error (accepting a false positive).

      We had already corrected all our statistical tests using the Benjamini-Hochberg method to control for false positives; see the SuppTables Excel file for the complete results of the statistical analyses. We chose this method over the Bonferroni correction because the more modern and balanced Benjamini-Hochberg procedure is better suited for analyses involving a large number of hypotheses.

      (5) The terms "arm" and "leg" used in the main text and Table 1 are anatomically incorrect. Instead, the terms "forelimb" and hindlimb" should be used as they include the length sum of the stylopod, zeugopod, and autopod.

      Indeed, thank you for pointing that out. We have corrected this error within the manuscript as well as in the figures 4 and S3.

      (6) On p. 14, the authors make the statement that the postcranial anatomy of Adapis and Notharctus remains undescribed. The authors should consult the work of Dagosto, Covert, Godinot and others.

      We did not state that the postcranial remains of Adapis and Notharctus have not been described. However, we were unfortunately unable to find published illustrations of the known postcranial elements that could be reliably used in this study. To avoid any misunderstanding, we removed the sentence such as: “However, we could not find suitable illustrations of the known postcranial elements of these species in the literature that could be reliably incorporated into this study. Thus, we only included their reconstructed body mass and EQ,..”. Now found in lines 393-397.

      Reviewer #2 (Recommendations for the authors):

      (1) Line 65/69 - Perchalski et al. 2021 is a single-author publication, so no et al. or w/ colleagues.

      Indeed. This has been corrected in the manuscript, now found in lines 65 and 70.

      (2) Lines 96-98 - Is it appropriate to say that the use of symmetrical gaits are examples of convergent evolution? There's less burden of evidence to state that these are shared behaviors, rather than suggesting they independently evolved across all those groups.

      We agree with this and corrected the sentence such as “Symmetrical gaits are also common in rodents, scandentians, etc..” Now found in line 107.

      (3) Line 198 - I am confused by how to interpret (-16,36 %) compared to how other numbers are presented in the rest of the paragraph.

      To avoid confusion, we rephrased this sentence such as: “In contrast, primates did not significantly reduce their speed compared to ascents when descending sideways or tail-first (Fig. 2A, SuppTables B).”  Now found in lines 207-209.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1(Public review):

      Summary:

      In this study, the authors aim to understand how Rhino, a chromatin protein essential for small RNA production in fruit flies, is initially recruited to specific regions of the genome. They propose that asymmetric arginine methylation of histones, particularly mediated by the enzyme DART4, plays a key role in defining the first genomic sites of Rhino localization. Using a combination of inducible expression systems, chromatin immunoprecipitation, and genetic knockdowns, the authors identify a new class of Rhinobound loci, termed DART4 clusters, that may represent nascent or transitional piRNA clusters.

      Strengths:

      One of the main strengths of this work lies in its comprehensive use of genomic data to reveal a correlation between ADMA histones and Rhino enrichment at the border of known piRNA clusters. The use of both cultured cells and ovaries adds robustness to this observation. The knockdown of DART4 supports a role for H3R17me2a in shaping Rhino binding at a subset of genomic regions.

      Weaknesses:

      However, Rhino binding at, and piRNA production from, canonical piRNA clusters appears largely unaffected by DART4 depletion, and spreading of Rhino from ADMArich boundaries was not directly demonstrated. Therefore, while the correlation is clearly documented, further investigation would be needed to determine the functional requirement of these histone marks in piRNA cluster specification.

      The study identify piRNA cluster-like regions called DART4 clusters. While the model proposes that DART4 clusters represent evolutionary precursors of mature piRNA clusters, the functional output of these clusters remains limited. Additional experiments could help clarify whether low-level piRNA production from these loci is sufficient to guide Piwi-dependent silencing.

      In summary, the authors present a well-executed study that raises intriguing hypotheses about the early chromatin context of piRNA cluster formation. The work will be of interest to researchers studying genome regulation, small RNA pathways, and the chromatin mechanisms of transposon control. It provides useful resources and new candidate loci for follow-up studies, while also highlighting the need for further functional validation to fully support the proposed model.

      We sincerely thank Reviewer #1 for the thoughtful and constructive summary of our work. We appreciate the reviewer’s recognition that our study provides a comprehensive analysis of the relationship between ADMA-histones and Rhino localization, and that it raises intriguing hypotheses about the early chromatin context of piRNA cluster formation.

      We fully agree with the reviewer that our data primarily demonstrate correlation between ADMA-histones and Rhino localization, rather than direct causation. In response, we have carefully revised the text throughout the manuscript to avoid overstatements implying causality (details provided below).

      We also acknowledge the reviewer’s important point that the functional requirement of ADMA-histones for piRNA clusters specification remains to be further established. We have now added the discussion about our experimental limitations (page 18).

      Overall, we have revised the manuscript to present our findings more cautiously and transparently, emphasizing that our data reveal a correlation between ADMA-histone marks and the initial localization of Rhino, rather than proving a direct mechanistic requirement. We thank the reviewer again for highlighting these important distinctions.

      Reviewer #2 (Public review):

      This study seeks to understand how the Rhino factor knows how to localize to specific transposon loci and to specific piRNA clusters to direct the correct formation of specialized heterochromatin that promotes piRNA biogenesis in the fly germline. In particular, these dual-strand piRNA clusters with names like 42AB, 38C, 80F, and 102F generate the bulk of ovarian piRNAs in the nurse cells of the fly ovary, but the evolutionary significance of these dual-strand piRNA clusters remains mysterious since triple null mutants of these dual-strand piRNA clusters still allows fly ovaries to develop and remain fertile. Nevertheless, mutants of Rhino and its interactors Deadlock, Cutoff, Kipferl and Moonshiner, etc, causes more piRNA loss beyond these dual-strand clusters and exhibit the phenotype of major female infertility, so the impact of proper assembly of Rhino, the RDC, Kipferl etc onto proper piRNA chromatin is an important and interesting biological question that is not fully understood.

      This study tries to first test ectopic expression of Rhino via engineering a Dox-inducible Rhino transgene in the OSC line that only expresses the primary Piwi pathway that reflects the natural single pathway expression the follicle cells and is quite distinct from the nurse cell germline piRNA pathway that is promoted by Rhino, Moonshiner, etc. The authors present some compelling evidence that this ectopic Rhino expression in OSCs may reveal how Rhino can initiate de novo binding via ADMA histone marks, a feat that would be much more challenging to demonstrate in the germline where this epigenetic naïve state cannot be modeled since germ cell collapse would likely ensue. In the OSC, the authors have tested the knockdown of four of the 11 known Drosophila PRMTs (DARTs), and comparing to ectopic Rhino foci that they observe in HP1a knockdown (KD), they conclude DART1 and DART4 are the prime factors to study further in looking for disruption of ADMA histone marks. The authors also test KD of DART8 and CG17726 in OSCs, but in the fly, the authors only test Germ Line KD of DART4 only, they do not explain why these other DARTs are not tested in GLKD, the UAS-RNAi resources in Drosophila strain repositories should be very complete and have reagents for these knockdowns to be accessible.

      The authors only characterize some particular ADMA marks of H3R17me2a as showing strong decrease after DART4 GLKD, and then they see some small subset of piRNA clusters go down in piRNA production as shown in Figure 6B and Figure 6F and Supplementary Figure 7. This small subset of DART4-dependent piRNA clusters does lose Rhino and Kipferl recruitment, which is an interesting result.

      However, the biggest issue with this study is the mystery that the set of the most prominent dual-strand piRNA clusters. 42AB, 38C, 80F, and 102F, are the prime genomic loci subjected to Rhino regulation, and they do not show any change in piRNA production in the GLKD of DART4. The authors bury this surprising negative result in Supplementary Figure 5E, but this is also evident in no decrease (actually an n.s. increase) in Rhino association in Figure 5D. Since these main piRNA clusters involve the RDC, Kipferl, Moonshiner, etc, and it does not change in ADMA status and piRNA loss after DART4 GLKD, this poses a problem with the model in Figure 7C. In this study, there is only a GLKD of DART4 and no GLKD of the other DARTs in fly ovaries.

      One way the authors rationalize this peculiar exception is the argument that DART4 is only acting on evolutionarily "young" piRNA clusters like the bx, CG14629, and CG31612, but the lack of any change on the majority of other piRNA clusters in Figure 6F leaves upon the unsatisfying concern that there is much functional redundancy remaining with other DARTs not being tested by GLKD in the fly that would have a bigger impact on the other main dual-strand piRNA clusters being regulated by Rhino and ADMA-histone marks.

      Also, the current data does not provide convincing enough support for the model Figure 7C and the paper title of ADMA-histones being the key determinant in the fly ovary for Rhino recognition of the dual-strand piRNA clusters. Although much of this study's data is well constructed and presented, there remains a large gap that no other DARTs were tested in GLKD that would show a big loss of piRNAs from the main dual-strand piRNA clusters of 42AB, 38C, 80F, and 102F, where Rhino has prominent spreading in these regions.

      As the manuscript currently stands, I do not think the authors present enough data to conclude that "ADMA-histones [As a Major new histone mark class] does play a crucial role in the initial recognition of dual-strand piRNA cluster regions by Rhino" because the data here mainly just show a small subset of evolutionarily young piRNA clusters have a strong effect from GLKD of DART4. The authors could extensively revise the study to be much more specific in the title and conclusion that they have uncovered this very unique niche of a small subset of DART4-dependent piRNA clusters, but this niche finding may dampen the impact and significance of this study since other major dual-strand piRNA clusters do not change during DART4 GLKD, and the authors do not show data GLKD of any other DARTs. The niche finding of just a small subset of DART-4-dependent piRNA clusters might make another specialized genetics forum a more appropriate venue.

      We are deeply grateful to Reviewer #2 for the detailed and insightful review that carefully situates our study in the broader context of Rhino-mediated piRNA cluster regulation. We appreciate the reviewer’s recognition that our inducible Rhino expression system in OSCs provides a valuable model to explore de novo Rhino recruitment under a simplified chromatin environment.

      At the same time, we agree that the current data mainly support a role for DART4 in regulating a subset of evolutionarily young piRNA clusters, and do not demonstrate a requirement for ADMA-histones at the major dual-strand piRNA clusters such as 42AB or 38C. We have therefore revised the title and main conclusions to more accurately reflect the scope of our findings.

      We agree with the reviewer that functional redundancy among DARTs may explain why major dual-strand piRNA clusters are unaffected by DART4 GLKD. Indeed, we have tried DART1 GLKD in the germline, which shows collapse of Rhino foci in OSCs.For DART1 GLKD, two approaches were possible:

      (1) Crossing the BDSC UAS-RNAi line (ID: 36891) with nos-GAL4.

      (2) Crossing the VDRC UAS-RNAi line (ID: 110391) with nos-GAL4 and UAS-Dcr2.

      The first approach was not feasible because the UAS-RNAi line always arrived as dead on arrival (DOA) and could not be maintained in our laboratory. The second approach did not yield effective and stable knockdown (as follows).

      DART8 and CG17726 did not alter Rhino foci in OSC knockdown experiments; therefore, we did not attempt germline knockdown (GLKD) of these DARTs in the ovary.  We agree with the reviewer’s opinion that there are piRNA source loci where Rhino localization depends on DART1, and that simultaneous depletion of multiple DARTs may indeed reveal additional positive results because ADMA-histones such as H3R8me2a may be completely eliminated by the knockdown of multiple DARTs. At the same time, we note that many evolutionarily conserved piRNA clusters show a loss of ADMA accumulation compared with evolutionarily young piRNA clusters, with levels that are comparable to the background input in ChIP-seq reads. Therefore, conserved clusters such as 42AB and 38C may no longer be regulated by ADMA. Even if multiple DARTs function redundantly to regulate ADMA, it may be difficult to disrupt Rhino localization at such conserved piRNA clusters by depletion of DARTs. While disruption of Rhino localization at conserved clusters like 42AB and 38C may be challenging, we cannot exclude the possibility that DART depletion affects Rhino binding at less conserved piRNA clusters, where ADMA modification remains detectable. We added clarifications in the Discussion to acknowledge the potential redundancy with other DARTs and to note that further knockdown experiments in the germline will be necessary to test this model comprehensively (page 18).

      We appreciate the reviewer’s critical feedback, which has helped us refine the message and strengthen the interpretative balance of the paper.

      Reviewer #1 (Recommendations for the authors):

      In multiple places, the link between ADMA histones and Rhino recruitment is presented in terms that imply causality. Please revise these statements to reflect that, in most cases, the evidence supports correlation rather than direct functional necessity. Similarly, statements suggesting that ADMA histones promote Rhino spreading should be revised unless supported by direct evidence.

      We sincerely thank the reviewer for the insightful comments. We recognize that these suggestions are crucial for improving the manuscript, and we have revised it accordingly to address the concerns. The specific revisions we made are detailed below.

      (1) Page 1, line 14: The original sentence “in establishing the sites” was changed to “may establish the potential sites.”

      (2) Page 4, lines 11-12: The original sentence “genomic regions where Rhino binds at the ends and propagates in the areas in a DART4-dependent manner, but not stably anchored” was changed to “genomic regions that have ADMA-histones at their ends and exhibit broad Rhino spreading across their internal regions in a DART4dependent manner”

      (3) Page4, lines 12-15: The original sentence “Kipferl is present at the regions but not sufficient to stabilize Rhino-genomic binding after Rhino propagates.” was changed to “In contrast to authentic piRNA clusters, Kipferl was lost together with Rhino upon DART4 depletion in these regions, suggesting that Kipferl by itself is not sufficient to stabilize Rhino binding; rather, their localization depends on DART4.”

      (4) Page4, lines17-18: The original sentence “are considered to be primitive clusters” was changed to “might be nascent dual-strand piRNA source loci”.

      (5) Page 8, line 7: The original sentence “Involvement of ADMA-histones in the genomic localization of Rhino was implicated.” was changed to “Correlation of ADMA-histones in the genomic localization of Rhino was implicated.”

      (6) Page 8, lines 19-21: The original sentence “These results suggest that ADMAhistones, together with H3K9me3, contribute significantly and specifically to the recruitment of Rhino to the ends of dual-strand clusters in OSCs.” was changed to “These results raise the possibility that ADMA-histones, together with H3K9me3, may contribute specifically to the recruitment of Rhino to the ends of dual-strand clusters in OSCs.”

      (7) Page 10, lines 11-13: The original sentence “These results suggest that DART1 and DART4 are involved in Rhino recruitment at distinct genomic sites through the decreases in ADMA-histones in each of their KD conditions (H4R3me2a and H3R17me2a, respectively).” was changed to ”These results suggest that DART1 and DART4 could contribute to Rhino recruitment at distinct genomic sites through the decreases in ADMA-histones in each of their KD conditions (H4R3me2a and H3R17me2a, respectively).”

      (8) Page 13, line 2: The original sentence “Genomic regions where Rhino spreads in a DART4-dependent manner, but not stably anchored, produce some piRNAs“ was changed to “Genomic regions where Rhino binds broadly in a DART4-dependent manner, but not stably anchored, produce some piRNAs”

      (9) Page 13, lines 21-22: The original sentence “These results support the hypothesis that ADMA-histones are involved in the genomic binding of Rhino both before and after Rhino spreading, resulting in stable genome binding.” was changed to “These results raise the possibility that a subset of Rhino localized to genomic regions correlating with ADMA-histones may serve as origins of spreading.”

      (10) Page 16, lines 6-8: The original sentence “In this study, we took advantage of cultured OSCs for our analysis and found that chromatin marks (i.e., ADMA-histones) play a crucial role in the loading of Rhino onto the genome.” was changed to “In this study, we took advantage of cultured OSCs for our analysis and found that chromatin marks (i.e., bivalent nucleosomes containing H3K9me3 and ADMA-histones) appear to contribute to the initial loading of Rhino onto the genome.”

      (11) Page16, line 12: The original sentence “We propose that the process of piRNA cluster formation begins with the initial loading of Rhino onto bivalent nucleosomes containing H3K9me3 and ADMA-histones (Fig. 7C). In OSCs, the absence of Kipferl and other necessary factors means that Rhino loading into the genome does not proceed to the next step.” was removed.

      Major points

      (1)  Clarify the limited colocalization between Rhino and H3K9me3 in OSCs. The observation that FLAG-Rhino foci show minimal overlap with H3K9me3 in OSCs appears inconsistent with the proposed model by the authors in the discussion, in which Rhino is initially recruited to bivalent nucleosomes bearing both H3K9me3 and ADMA marks. This discrepancy should be addressed. 

      We thank the reviewer’s insightful comments. Indeed, ChIP-seq shows that Rhino partially overlaps with H3K9me3 (Fig. 1F), but immunofluorescence did not reveal any detectable overlap (Fig. 1A). We interpret this discrepancy as arising from the fact that immunofluorescence primarily visualizes H3K9me3 foci that are localized as broad domains in the genome, such as those at centromeres, pericentromeres, or telomeres (named chromocenters), whereas the sharp and interspersed H3K9me3 signals along chromosome arms are difficult to detect by immunofluorescence. We now have these explanations in the revised text (page 6).

      (2)  Please indicate whether the FLAG-Rhino used in OSCs has been tested for functionality in vivo-for example, by rescuing Rhino mutant phenotypes. This is particularly relevant given that no spreading is observed with this construct.

      We thank the reviewer for raising this important point. We have not directly tested the functionality of FLAG-Rhino construct used in OSCs in living Drosophila fly; i.e., it has not been used to rescue Rhino mutant phenotypes in flies. We acknowledge that FLAGRhino has not previously been expressed in OSCs, and that its localization pattern in OSCs differs from that observed in ovaries, where Rhino is endogenously expressed. However, several lines of evidence suggest that the addition of the N-terminal FLAG tag is unlikely to compromise Rhino function

      (1) In previous studies, N-terminally tagged Rhino (e.g., 3xFLAG-V5-Precision-GFPRhino) was expressed in a living Drosophila ovary and was shown to localize properly to piRNA clusters, indicating that the tag does not prevent Rhino from binding its genomic targets (Baumgartner et al., 2022; eLife. Fig. 3 supplement 1G).

      (2) In Drosophila S2 cells, FLAG-tagged tandem Rhino chromodomains construct was shown to bind H3K9me3/H3K27me3 bivalent chromatin, demonstrating that the FLAG tag does not impair this fundamental chromatin interaction (Akkouche et al., 2025; Nat Struct Mol Biol. Fig. 4b).

      (3) GFP-tagged Rhino has been demonstrated to rescue the transposon derepression phenotype of Rhino mutant flies, further supporting that the addition of tags does not abolish its in vivo function. (Parhad et al., 2017; Dev Cell. Fig.1D).

      Therefore, we interpret the partial localization of FLAG-Rhino in OSCs as reflecting the specific chromatin environment and regulatory context of OSCs rather than functional impairment due to the FLAG tag.

      (3) Given the low levels of piRNA production and the absence of measurable effects on transposon expression or fertility upon DART4 knockdown, the rationale for classifying these regions as piRNA clusters should be clearly stated. Additional experiments could help clarify whether low-level piRNA production from these loci is sufficient to guide Piwidependent silencing. The authors should also consider and discuss the possibility that some of these differences may reflect background-specific genomic variation rather than DART4-dependent regulation per see.

      We thank the reviewer for the insightful comments. As noted, DART4 knockdown did not measurably affect transposon expression or fertility. piRNAs generated from DART4associated clusters associate with Piwi but are insufficient for target repression. Although loss of DART4 largely eliminated piRNAs from these clusters, the cluster-derived transcripts themselves were unchanged. To clarify this point, we now refer to these regions as DART4-dependent piRNA-source loci (DART4 piSLs) in the revised text. We also acknowledge that some observed differences may reflect strain-specific genomic variation and have added this caveat on page 16.

      (4)  The authors should describe the genomic context of DART4 clusters in more detail. Specifically, it would be helpful to indicate whether these regions overlap with known transposable elements, gene bodies, or intergenic regions, and to report the typical size range of the clusters. Are any of the piRNAs produced from these clusters predicted to target known transcripts? 

      We thank the reviewer’s insightful comments. The overlap of DART4 piSL with transposable elements, gene bodies, and intergenic regions is shown in the right panel of Supplementary Fig. 6E (denoted as “Rhino reduced regions in DART4 GLKD” in the figure). The typical size range of these clusters is presented in Supplementary Fig. 6G. The annotation of piRNA reads derived from these piSL is shown in the right panel of Supplementary Fig. 6F, indicating that most of them appear to target host genes. The specific genes and transposons matched by the piRNAs produced from DART4 piSL are listed in Supplementary Table 8.

      (5)  While correlations between Rhino and ADMA histone marks (especially H3R8me2a,H3R17me2a, H4R3me2a) are robust, many ADMA-enriched regions do not recruit Rhino. Please discuss this observation and consider the possible involvement of additional factors.

      We thank the reviewer’s insightful comments. As pointed out, not all ADMA-enriched regions recruit Rhino; rather, Rhino is recruited only at sites where ADMAs overlap with H3K9me3. Furthermore, the combination of H3K9me3 and ADMAs alone does not fully account for the specificity of Rhino recruitment, suggesting the involvement of additional co-factors (for example, other ADMA marks such as H3R42me2a, or chromatininteracting proteins). In addition, since histone modifications—including arginine methylation—have the possibility that they are secondary consequences of modifications on other proteins rather than primary regulatory events, it is possible that DART1/4 contribute to Rhino recruitment not only through histone methylation but also via arginine methylation of non-histone chromatin-interacting factors. However, methylation of HP1a does not appear to be involved (Supplementary Fig. 3G). We have added new sentences about these points in the Discussion section (page 18).

      (6) The manuscript states that Kipferl is present at DART4 clusters but does not stabilize Rhino binding. Please specify which experimental results support this conclusion and explain.

      We apologize for the lack of clarity regarding Kipferl data. Supplementary Fig. 7A and 7B show that Kipferl localizes at major DART4 piSL. This Kipferl localization is lost together with Rhino upon DART4 GLKD, indicating that Rhino localization at DART4 piSL depends on DART4 rather than on Kipferl. From these results, we infer that, unlike at authentic piRNA clusters, Kipferl may not be sufficient to stabilize the association of Rhino with the genome at DART4 piSL. We have added this interpretation on page 14.

      Minor points

      (1) Figure 1D: Please specify which piRNA clusters are included in the metaplot - all clusters, or only the major producers? 

      We thank the reviewer for the question. The metaplot was not generated from a predefined list of “all” piRNA clusters or only the “major producers.” Instead, it was constructed from Rhino ChIP–seq peaks (“Rhino domains”) that are ≥1.5 kb in length.These Rhino domains mainly correspond to the subregions within major dual-strand clusters (e.g., 42AB, 38C) as well as additional clusters such as 80F, 102F, and eyeless, among others. We have provided the full list of domains and their corresponding piRNA clusters (with genomic coordinates) in Supplementary Table 9 and added the additional explanation in Fig. 1d legend.

      (2) Supplemental Figure 5E is referred to as 5D in the main text.

      We corrected the figure citations on pages 11-12: the reference to Supplementary Fig. 5E has been changed to 5D, and the reference to Supplementary Fig. 5F has been changed to 5E.

      (3) Supplemental Figure 7C: The color legend does not match the pie chart, which may confuse readers.

      We thank the reviewer for the helpful comment. We are afraid we were not entirely sure what specific aspect of the legend was confusing, but to avoid any possible misunderstanding, we revised Supplemental Fig. 7C so that the color boxes in the legend now exactly match the corresponding colors in the pie chart. We hope this modification improves clarity.

      (4) Since the manuscript focuses on the roles of DART1 and DART4, including their expression profiles in OSCs and ovaries would help contextualize the observed phenotypes. Please consider adding this information if available.

      We thank the reviewer for the suggestion. We have now included a scatter plot comparing RNA-seq expression in OSCs and ovaries (Supplementary Fig. 3H). In these datasets, DART1 is strongly expressed in both tissues, whereas DART4 shows no detectable reads. Notably, ref. 28 reports strong expression of both DART1 and DART4 in ovaries by western blot and northern blot. In our own qPCR analysis in OSCs, DART4 expression is about 3% of DART1, which, although low, may still be sufficient for functional roles such as modification of H3R17me2a (Fig. 3C, Supplementary Fig. 3F and 3I). We have added these new data and additional explanation in the revised manuscript (page 11).

      (5) Several of the genome browser snapshots, particularly scale and genome coordinates, are difficult to read. 

      We apologize for the difficulty in reading several of the genome browser snapshots in the original submission. We have re-generated the relevant figures using IGV, which provides clearer visualization of scale and genome coordinates. The previous images have been replaced with the improved versions in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      (1) The authors need to elaborate on what this sentence means, as it is very unclear what they are describing about Rhino residency: "The results show that Rhino in OSCs tends to reside in the genome where Rhino binds locally in the ovary (Fig. 1C)." 

      We apologize for the lack of clarity in the original sentence. The text has been revised as follows:

      ”Rhino expressed in OSCs bound predominantly to genomic sites exhibiting sharp and interspersed Rhino localization patterns in the ovary, while showing little localization within broad Rhino domains, including major piRNA clusters.”

      In addition, to clarify the behavior of Rhino at broad domains, we have added the phrase “the terminal regions of broad domains, such as major piRNA clusters” to the subsequent sentence.

      (2) The red correlation line is very confusing in Figure 5F. What sort of line does this mean in this scatter plot? 

      We apologize for the lack of clarity regarding the red line in Fig. 5F. The red line represents the least-squares linear regression fit to the data points, calculated using the lm() function in R, and was added with abline() to illustrate the correlation between ctrl GLKD and DART4 GLKD values. In the revised figure, we have clarified this in the legend by specifying that it is a regression line.

      (3) There is no confirmation of the successful knockdown of the various DARTs in the OSCs.

      We thank the reviewer for the comment. The knockdown efficiency of the various DARTs in OSCs was confirmed by RT–qPCR. The data are now shown in Supplementary Fig. 3J. 

      (4) What is the purpose of an unnumbered "Method Figure" in the supplementary data file? Why not just give it a number and mention it properly in the text? 

      We thank the reviewer for the suggestion. We have now assigned a number to the previously unnumbered "Method Figure" and have included it as Supplementary Fig. 9.

      The figure is now properly cited in the Methods section.

      (5) For Figure 5A, those fly strain numbers in the labels are better reserved in the Methods, and a more appropriate label is to describe the GAL4 driver and the UAS-RNAi construct by their conventional names.

      We thank the reviewer for the suggestion. The labels in Fig. 5A have been updated to use the conventional names of the GAL4 drivers and UAS-RNAi constructs. Specifically, they now read Ctrl GLKD (nos-GAL4 > UAS-emp) and DART4 GLKD (nos-GAL4 > UASDART4). The original fly strain numbers are listed in the Methods section.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful study presents the potentially interesting concept that LRRK2 regulates cellular BMP levels and their release via extracellular vesicles, with GCase activity further modulating this process in mutant LRRK2-expressing cells. However, the evidence supporting the conclusions remains incomplete, and certain statistical analyses are inadequate. This work would be of interest to cell biologists working on Parkinson's disease.

      Reviewer #1 (Public review):

      Summary:

      Even though mutations in LRRK2 and GBA1 (which encodes the protein GCase) increase the risk of developing Parkinson's disease (PD), the specific mechanisms driving neurodegeneration remain unclear. Given their known roles in lysosomal function, the authors investigate how LRRK2 and GCase activity influence the exocytosis of the lysosomal lipid BMP via extracellular vesicles (EVs). They use fibroblasts carrying the PDassociated LRRK2-R1441G mutation and pharmacologically modulate LRRK2 and GCase activity.

      Strengths:

      The authors examine both proteins at endogenous levels, using MEFs instead of cancer cells. The study's scope is potentially interesting and could yield relevant insights into PD disease mechanisms.

      Weaknesses:

      Many of the authors' conclusions are overstated and not sufficiently supported by the data. Several statistical errors undermine their claims. Pharmacological treatment is very long, leading to potential off-target effects. Additionally, the authors should be more rigorous when using EV markers.

      We thank the reviewer for these valuable observations. In the revised manuscript, we have addressed each of these points as follows:

      (1) Conclusions and data support – We carefully revised our text throughout the manuscript to ensure that all conclusions are better supported by the presented data. For instance, we now explicitly state that while pharmacological modulation supports the regulatory role of LRRK2 activity in EV-mediated BMP release, we have softened our conclusions concerning the contribution of GCase in this model (see revised Results and Discussion sections).

      (2) Statistical analyses – We reanalyzed experiments involving more than two groups and replaced simple t-tests with non-parametric Kruskal-Wallis tests followed by Dunn’s post hoc comparisons. This approach, described in the updated figure legends (e.g., Figure 2D-F and H-J), provides a more rigorous statistical framework that accounts for small sample sizes and variability typical of EV quantifications.

      (3) Pharmacological treatment duration – Prolonged MLi-2 treatments have been extensively used in the field without evidence of significant off-target effects. Several studies, including Fell et al. (2015, J Pharmacol Exp Ther 355:397-409), De Wit et al. (2019, Mol Neurobiol 56:5273-5286), Ho et al. (2022, NPJ Parkinson’s Dis 8:115),Tengberg et al. (2024, Neurobiol Dis 202:106728), and Jaimon et al. (2025, Sci Signal 18:eads5761), have applied long-term (24-48 h) MLi-2 treatments at comparable concentrations without detecting toxicity or off-target alterations, including in MEFs (Ho et al., 2022; Dhekne et al., 2018, eLife 7:e40202).  In our study, 48-hour incubations were necessary to sustain full LRRK2 inhibition throughout the extracellular vesicle (EV) collection period. EV biogenesis, BMP biosynthesis, and packaging into EVs are timedependent processes; therefore, extended incubation and collection periods (48 h) were required to allow downstream effects of LRRK2 inhibition on BMP production and release to manifest, and to obtain sufficient EV material for biochemical and lipidomic analyses. This experimental design also reflects our and others’ previous observations in humans and non-human primates, where urinary BMP changes are associated with chronic or subchronic LRRK2 inhibitor treatment (Baptista MAS, Merchant K, et al. Sci Transl Med. 2020, 12:eaav0820; Jennings D, et al. Sci Transl Med. 2022, 14:eabj2658; Maloney MT, et al. Mol Neurodegener. 2025, 20:89). Importantly, under these conditions, we did not observe significant changes in cell viability or morphology, supporting that the treatment was well tolerated.  We have clarified this rationale in the revised Methods section to emphasize that the prolonged incubation reflects the experimental design for EV isolation rather than a requirement for achieving LRRK2 inhibition.

      (4) EV markers – We and others have reported enrichment of Flotillin-1 and LAMP proteins in isolated small EV fractions (Kowal et al., 2016; Lu et al., 2018; Mathieu et al., 2021; Ferreira et al., 2022). Moreover, LAMP proteins have been reported to be more enriched in EVs of endolysosomal origin (Mathieu et al., 2021). To further strengthen this point, we performed new experiments using a CD63-pHluorin sensor combined with TIRF microscopy, which allowed real-time visualization of CD63-positive exosome release. These new data (now presented in Figure 7, Panels G-I; Videos 1 and 2) confirm increased CD63-positive EV release in LRRK2 mutant fibroblasts, which was reversed by LRRK2 inhibition with MLi-2. The CD63-positive compartment was also largely BMPpositive (new Figure 7D, F, G), reinforcing our conclusions and providing additional rigor in EV marker validation.

      Reviewer #2 (Public review):

      Summary:

      In this paper, the authors used MEFs expressing the R1441G mutant of leucine-rich repeat kinase 2 (LRRK2), a mutant associated with the early onset of Parkinson's disease. They report that in these cells LAMP2 fluorescence is higher but BMP fluorescence is lower, MVE size is reduced, and that MVEs contain less ILVs. They also report that LAMP2-positive EVs are increased in mutant cells in a process sensitive to LRRK2 kinase inhibition but are further increased by glucocerebrosidase (GCase) inhibition, and that total di-22:6-BMP and total di-18:1-BMP are increased in mutant LRRK2 MEFs compared to WT cells by mass spectrometry. They also report that LRRK2 kinase inhibition partially restores cellular BMP levels, and that GCase inhibition further increases BMP levels, and that in EVs from the LRRK2 mutant, LRRK2 inhibition decreases BMP while GCase inhibition has the opposite effect. Moreover, they report that the BMP increase is not due to increased BMP synthesis, although the authors observe that CLN5 is increased in LRRK2 mutant cells. Finally, they report that GW4869 decreases EV release and exosomal BMP, while bafilomycin A1 increases EV release. They conclude that LRRK2 regulates BMP levels (in cells) and release (via EVs). They also conclude that the process is modulated by GCase in LRRK2 mutant cells, and that these studies may contribute to the use of BMP-positive EVs as a biomarker for Parkinson's disease and associated treatments.

      Strengths:

      This is an interesting paper, which provides novel insights into the biogenesis of exosomes with exciting biomedical potential. However, I have comments that authors need to address to clarify some aspects of their study.

      Weaknesses:

      (1) The intensity of LAMP2 staining is increased significantly in cells expressing the R1441G mutant of LRRK2 when compared to WT cells (Figure 1C). Yet mutant cells contain significantly smaller MVEs with fewer ILVs, and the MVE surface area is reduced (Figure 1D-F). This is quite surprising since LAMP2 is a major component of the limiting membrane of late endosomes. Are other proteins of endo-lysosomes (eg, LAMP1, CD63, RAB7) or markers (lysotracker) also decreased (see also below)?

      As referenced in our original manuscript, several previous studies have reported endolysosomal morphological and homeostatic defects in cells harboring pathogenic LRRK2 mutations. LAMP2 can be upregulated as part of a lysosomal biogenesis or stress response (e.g., via MiT/TFE transcription factors such as TFEB; Sardiello et al., Science 2009, 325:473-477), whereas ILV biogenesis is primarily controlled by ESCRT- and SMPD3-dependent pathways that are regulated independently of MiT/TFE-driven transcriptional programs. Indeed, Stuffers et al. (Traffic 2009, 10:925-937) demonstrated that depletion of key ESCRT subunits markedly inhibited ILV formation while concomitantly increasing LAMP2 expression, highlighting the mechanistic dissociation between LAMP2 abundance and ILV number. In our study, we observed a similar pattern in R1441G LRRK2 MEFs, in which elevated LAMP2 staining and protein levels occurred despite a reduction in MVE size and ILV number. We interpret this as a compensatory lysosomal biogenesis response.

      Our revised manuscript now includes new immunofluorescence data for BMP, LAMP1 and CD63 (New Figure 7, Panels A-F) together with biochemical analysis of CD63 protein levels (New Supplemental Figure 4, Panel B) in human skin fibroblasts derived from healthy donors and LRRK2 G2019S PD patients. Quantitative analysis of these experiments revealed no statistically significant differences in total cellular levels of either LAMP1 or CD63 between groups. However, we observed a consistent decrease in BMP immunostaining intensity (New Figure 7, Panel A and B), in agreement with our findings in mouse fibroblasts. We therefore propose that the elevated LAMP2 expression observed in the engineered MEF clone expressing R1441G may reflect a cell type-specific effect, potentially linked to differential penetrance of LRRK2 signaling on the lysosomal biogenesis response. We have updated the Results and Discussion section of the manuscript to incorporate and clarify these findings.

      (2) LRRK2 has been reported to interact with endolysosomal membranes. Does the R1441G mutant bind LAMP2- and/or BMP-positive membranes? 

      We agree that LRRK2 has been reported to associate dynamically with endolysosomal membranes, particularly under conditions of endolysosomal stress or damage (Eguchi T, et al. PNAS 2018, 115:E9115-E9124; Bonet-Ponce L, et al. Sci Adv. 2020, 6:eabb2454; Wang X, et al. Elife. 2023, 12:e87255).

      Nevertheless, to explore whether LRRK2 associates with BMP-positive endolysosomes, we performed subcellular fractionation followed by biochemical analysis of endolysosomal fractions, since our available LRRK2 antibodies did not provide reliable immunofluorescence signals. These experiments were carried out using human skin fibroblasts derived from both healthy controls and Parkinson’s disease patients carrying the LRRK2-G2019S mutation. In both control and mutant fibroblasts, a pool of LRRK2 was detected in fractions positive for the BMP synthase CLN5 and the endolysosomal marker CD63 (New Supplementary Figure 4, Panel A), supporting the localization of LRRK2 to endolysosomal membranes that are likely BMP-enriched. Our manuscript’s Results and Methods sections have been updated accordingly.

      Does the mutant affect endolysosomes?

      As referenced in our original manuscript, several studies have reported that pathogenic LRRK2 mutations can lead to endolysosomal defects. Consistent with these reports, we also observed morphological alterations in endolysosomes of cells expressing mutant LRRK2, including reduced MVE size and fewer ILVs, as shown in Figure 1D–F. These observations are in agreement with previously described phenotypes associated with pathogenic LRRK2 variants. Furthermore, in mutant LRRK2 MEFs, and now in humanderived fibroblasts (see new Figure 7, Panel A and B), we observed a decrease in BMP immunostaining signal.

      (3) Immunofluorescence data indicate that BMP is decreased in mutant LRRK2expressing cells compared to WT (Figure 1A-B), but mass spec data indicate that di-22:6BMP and di-18:1-BMP are increased (Figure 3). Authors conclude that the BMP pool detected by mass spec in mutant cells is less antibody-accessible than that present in wt cells, or that the anti-BMP antibody is less specific and that it detects other analytes. This is an awkward conclusion, since the IF signal with the antibody is lower (not higher): why would the antibody be less specific? Could it be that the antibody does not see all BMP isoforms equally well? Moreover, the observations that mutant cells contain smaller MVEs (Figure 1D-F) with fewer ILVs are consistent with the IF data and reduced BMP amounts. This needs to be clarified.

      As previously reported by us (Lu et al., J Cell Biol 2022;221:e202105060) and others (Berg AL, et al. Cancer Lett. 2023, 557:216090), discrepancies can occur between BMP levels detected by immunofluorescence and those quantified by mass spectrometry. This is because immunostaining reflects the pool of antibody-accessible BMP, whereas lipidomics measures the total cellular content of all BMP molecular species, irrespective of their distribution or accessibility.

      We agree that the anti-BMP antibody may not detect all BMP isoforms equally well. Differences in acyl chain composition (such as the degree of saturation or chain length) can alter the stereochemistry of BMP and, consequently, epitope accessibility to antibody binding.

      In addition, in a personal communication with Monther Abu-Remaileh (Stanford University), we were informed that the antibody may also cross-react with other lipid species in endolysosomes. Nevertheless, since there is no formal evidence supporting this, we have removed the sentence in the Discussion section stating “Alternatively, the antibody may also detect non-BMP analytes” to avoid any potential misinterpretations. In its place, we have added a short statement noting that “not all BMP isoforms may be detected equally well”.

      Mass spectrometry data are only shown for two BMP species (di-22:6, di-18:1). What are the major BMP isoforms in WT cells? The authors should show the complete analysis for all BMP species if they wish to draw quantitative conclusions about the amounts of BMP in wt and mutant cells. Finally, BMP and PG are isobaric lipids. Fragmentation of BMPs or PGs results in characteristic fingerprints, but the presence of each daughter ion is not absolutely specific for either lipid. This should be clarified, e.g., were BMP and PG separated before mass spec analysis? Was PG affected? The authors should also compare the BMP data with mass spec data obtained with a control lipid, e.g., PC.

      Regarding BMP isoforms, our targeted UPLC-MS/MS analyses revealed that 2,2′-di-22:6-BMP (sn2/sn2′) and 2,2′-di-18:1-BMP (sn2/sn2′) are the predominant BMP isoforms in MEF cells, consistent with previous reports showing docosahexaenoyl (22:6; DHA) and oleoyl (18:1) BMP as the most abundant isoforms. Across diverse mammalian cells and tissues, BMP typically exhibits a fatty acid composition dominated by oleoyl, with polyunsaturated fatty acids (particularly DHA) also contributing substantially. Enrichment of DHA-containing BMP species has been observed in multiple systems, including rat uterine stromal cells, PC12 cells, THP-1 and RAW macrophages, as well as in rat and human liver. This consistent presence of oleoyl- and docosahexaenoyl-containing BMP species across tissues indicates that these acyl chains are conserved features influencing the lipid’s structural and functional characteristics (Kobayashi et al. J Biol Chem, 2002; Hullin-Matsuda et al. Prostaglandins Leukotriens Essent Fatty Acids, 2009; Thompson et al. Int J Toxicol. 2012; Delton-Vandenbroucke et al. J Lipid Res, 2019).

      Nevertheless, we have included a Table (Panel H in updated Supplemental Figure 1) showing other BMP species that were also detected in our lipidomics analysis. Overall, dioleoyl (18:1)- and di-docosahexaenoyl (22:6)-BMP species were the most abundant in MEF cells, whereas di-arachidonoyl (20:4)- and di-linoleoyl (18:2)-BMP isoforms were present at lower levels. Consistently, R1441G LRRK2 MEFs displayed higher levels of dioleoyl- and di-docosahexaenoyl-BMP compared with WT cells, and these elevations were reduced following LRRK2 kinase inhibition with MLi-2. Data from three independent representative experiments are shown, and the manuscript has been revised accordingly to include these results.

      Regarding the separation of BMP and PG species, we confirm that BMP and PG were chromatographically resolved prior to MS/MS detection using a validated UPLC-MS/MS method developed by Nextcea, Inc. PG exhibits a substantially longer LC retention time than BMP, ensuring complete baseline separation. This approach (established by Nextcea nearly two decades ago and later validated through a multi-year collaboration with the U.S. FDA to clinically qualify di-22:6-BMP as a biomarker) prevents any ambiguity arising from the isobaric nature of BMP and PG species. No changes in PG levels were detected under any experimental conditions.

      Finally, we employed isotope-labeled BMP as an internal standard to ensure robust normalization across samples. These additional details and references cited above have been included in the revised Methods and References sections to further clarify the analytical rigor of our lipidomics workflow.

      (4) It is quite surprising that the amounts of labeled BMP continue to increase for up to 24h after a short 25min pulse with heavy BMP precursors (Figure 4B).

      In these isotope-labeling experiments, it is important to note (as described in our original manuscript) that two distinct pools of metabolically labeled BMP species were detected: semi-labeled BMP (with only one heavy isotope-labeled fatty acyl chain) and fully-labeled BMP (with both fatty acyl chains labeled). We consider the fully-labeled BMP pool to provide the most reliable readout for BMP turnover, as it showed a rapid decline after a 1h chase (decreasing by more than 50% within 8 h in all conditions), reaching its lowest levels at the end of the 48-h chase period.

      The apparent increase in semi-labeled BMP species over time may be explained by continued incorporation of labeled precursors following the initial pulse. Specifically, once existing semi-labeled and fully-labeled BMP molecules are degraded by PLA2G15 (Nyame K, et al. Nature 2025, 642:474-483), the resulting isotope-labeled lysophosphatidylglycerol (LPG) and fatty acids could be recycled and re-enter a new round of BMP biosynthesis, leading to a gradual accumulation of semi-labeled BMP such as di-18:1-BMP. Why would this reasoning not also apply to the fully-labeled species? Once the pulse is completed, newly incorporated non-labeled fatty acyl chains present in the cellular pool can compete with labeled ones during subsequent rounds of lipid remodeling or synthesis. As a result, the probability of generating semi-labeled BMP molecules becomes higher than that of forming fully-labeled species. Consistent with this, our data show an increase in only semi-labeled BMP species (but not in fully-labeled ones) up to 24 hours after the pulse. We have added a clarification regarding this point in the revised manuscript.

      (5) It is argued that upregulation of CLN5 may be due to an overall upregulation of lysosomal enzymes, as LAMP2 levels were also increased (Figure 2A, C, E). Again, this is not consistent with the observed decrease in MVE size and number (Figure 1D-F). As mentioned above, other independent markers of endo-lysosomes should be analyzed (eg, LAMP1, CD63, RAB7), and/or other lysosomal enzymes (e.g. cathepsin. D).

      Our revised manuscript now includes new immunofluorescence data for BMP, LAMP1 and CD63 (New Figure 7, Panels A-F) together with biochemical analysis of CD63 protein levels (New Supplemental Figure 4, Panel B) in human skin fibroblasts derived from healthy controls and LRRK2 G2019S PD patients. Quantitative analysis of these experiments revealed no statistically significant differences in total cellular levels of either LAMP1 or CD63 between groups. However, our results consistently show increased CLN5 protein levels in both mouse and human fibroblast cell lines harboring pathogenic LRRK2 mutations. Upregulation of CLN5 may reflect a compensatory effect from loss of BMP via EV exocytosis. As discussed above, the elevated LAMP2 signal observed in the engineered MEF clone expressing R1441G could represent a cell type-specific effect, potentially linked to differential penetrance of LRRK2 signaling on the lysosomal biogenesis response. Our Results and Discussion sections have been updated accordingly.

      (6) The authors report that the increase in BMP is not due to an increase in BMP synthesis (Figure 4), although they observe a significant increase in CLN5 (Figure 5A) in LRRK2 mutant cells. Some clarification is needed.

      In our original manuscript, we proposed that although CLN5 protein levels are increased in R1441G LRRK2 MEFs, the absence of significant changes in BMP synthesis rates (Figure 4B, C) may reflect either limited substrate availability or that CLN5 is already operating near its maximal enzymatic capacity. Our new subcellular fractionation data (new Figure 7, Panel A) further indicate that, despite a relative increase in total CLN5 levels in G2019S LRRK2 human fibroblasts, the amount of CLN5 associated with endolysosomes remains comparable between mutant LRRK2 and control cells. This suggests that a considerable fraction of upregulated CLN5 may not localize to endolysosomes, potentially accumulating in the endoplasmic reticulum due to enhanced translation or impaired trafficking. Unfortunately, the available anti-CLN5 antibody did not yield reliable immunofluorescence signals, preventing us from directly confirming this possibility. Nevertheless, in light of our new data (new Supplemental Figure 4A), we have included a clarification in the revised manuscript discussing this possibility as well.

      (7) Authors observe that both LAMP2 and BMP are decreased in EVs by GW4869 and increased by bafilomycin (Figure 6). Given my comments above on Figure 1, it would also be nice to illustrate/quantify the effects of these compounds on cells by immunofluorescence.

      We appreciate the reviewer’s suggestion. We have previously published immunofluorescence data showing increased BMP accumulation in endolysosomes following treatment with bafilomycin A1 Lu A, et al. J Cell Biol. 2009, 184:863-879). However, in the present study, our lipidomics analyses revealed a decrease in both di22:6-BMP and di-18:1-BMP species in cells treated with this compound. As discussed above, this apparent discrepancy likely reflects methodological differences between immunofluorescence, which detects only antibody-accessible BMP pools, and lipidomics, which quantifies total cellular BMP content. 

      Moreover, in a recent study (Andreu Z, et al. Nanotheranostics 2023, 7:1-21), BMP levels were analyzed by immunofluorescence in cells treated with spiroepoxide, a potent and selective irreversible inhibitor of nSMase (different from GW4869) known to block EV release. Spiroepoxide-treated cells showed decreased BMP immunostaining; a result that, again, does not align with mass spectrometry data revealing increased cellular BMP levels upon GW4869 treatment. Notably, in that study, spiroepoxide was used instead of GW4869 because the intrinsic autofluorescence of GW4869 could potentially interfere with the immunofluorescence BMP signal.

      We therefore consider lipidomics measurements to provide a more reliable and quantitative representation of BMP dynamics under these conditions.

      Reviewer #1 (Recommendations for the authors):

      Major concerns:

      (1) 48 h for MLi2 treatment seems too long. LRRK2 kinase activity is inhibited with much shorter incubation times. The longer the incubation, the more likely off-target effects are. The authors should repeat these experiments with 1-2 h of MLi2.

      We thank the reviewer for this valuable comment. We acknowledge that MLi-2 is a potent and selective LRRK2 kinase inhibitor that achieves near-complete target engagement within a few hours of treatment. However, prolonged exposure has been widely used in the field without evidence of significant off-target effects. Several studies, including Fell et al. (2015, J Pharmacol Exp Ther 355:397-409), De Wit et al. (2019, Mol Neurobiol 56:5273-5286), Ho et al. (2022, NPJ Parkinson’s Dis 8:115), Tengberg et al. (2024, Neurobiol Dis 202:106728), and Jaimon et al. (2025, Sci Signal 18:eads5761), have employed long-term (24-48 h) MLi-2 treatments at comparable concentrations without detecting toxicity or off-target alterations, including in MEFs (Ho et al., 2022; Dhekne et al., 2018, eLife 7:e40202).

      In our study, 48-hour incubations were necessary to sustain full LRRK2 inhibition throughout the extracellular vesicle (EV) collection period. EV biogenesis, BMP biosynthesis, and packaging into EVs are time-dependent processes; therefore, extended incubation and collection periods (48 h) were required to allow downstream effects of LRRK2 inhibition on BMP production and release to manifest, and to obtain sufficient EV material for biochemical and lipidomic analyses. This experimental design also reflects our and others’ previous observations in humans and non-human primates, where urinary BMP changes are associated with chronic or subchronic LRRK2 inhibitor treatment (Baptista MAS, Merchant K, et al. Sci Transl Med. 2020, 12:eaav0820; Jennings D, et al. Sci Transl Med. 2022, 14:eabj2658; Maloney MT, et al. Mol Neurodegener. 2025, 20:89). Importantly, under these conditions, we did not observe significant changes in cell viability or morphology, supporting that the treatment was well tolerated.

      We have clarified this rationale in the revised Methods section to emphasize that the prolonged incubation reflects the experimental design for EV isolation rather than a requirement for achieving LRRK2 inhibition.

      (2) Is there a reason why the authors don't include CD81, CD63, and Syntenin-1 in their study as an EV marker? Using solely Flotilin-1 does not seem to be enough to justify their claims.

      We actually used not only Flotillin-1 but also LAMP2 as EV markers in our study. While both Flotillin-1 and LAMP2 detection on EVs may vary depending on the cell type, we and others have reported enrichment of Flotillin-1 and LAMP proteins in isolated small EV fractions (Kowal et al., 2016; Lu et al., 2018; Mathieu et al., 2021; Ferreira et al., 2022). In particular, one of these studies reported that “LAMP1-positive subpopulations of EVs represent MVB/lysosome-derived exosomes, which also contain syntenin-1.” Therefore, our choice of EV markers (LAMP2 and Flotillin-1) is consistent with those previously and reliably used to characterize small EVs.

      Nevertheless, to further address the reviewer’s concern, we performed additional experiments using a CD63-based fluorescence sensor (CD63-pHluorin), which, combined with TIRF microscopy, enables real-time visualization of CD63-positive exosome release. These experiments were conducted in control and LRRK2-mutant fibroblasts, and the data are presented in new Figure 7 (Panels G-I; Videos 1 and 2). We have also included all relevant references and clarified this point in the revised manuscript.

      (3) Indeed, to quantify the amount of certain proteins in EVs, the authors should normalize them by CD63 or CD81.

      Protein normalization in isolated EV fractions is indeed challenging. Although tetraspanins such as CD63 and CD81 are commonly enriched in EVs, their abundance can vary considerably across EV subpopulations, cell types, and experimental conditions, making them unreliable as universal normalization markers (Théry et al., J Extracell Vesicles, 2018; Margolis & Sadovsky, Nat Rev Mol Cell Biol, 2019).  Current guidelines from the International Society for Extracellular Vesicles (ISEV), as described in the Minimal Information for Studies of Extracellular Vesicles 2018 (MISEV2018; Théry C, et al. JExtracell Vesicles. 2018, 7:1535750) and updated in MISEV2024 (Welsh JA, et al. J Extracell Vesicles. 2024, 13:e12404), recommend reporting multiple EV markers rather than relying on a single protein for normalization. They also suggest ensuring comparable experimental conditions by using the same number of cells at the start of the experiment and normalizing EV data to cell number or whole-cell lysate protein content at the end of the experiment, among other approaches.

      In our study, we normalized EV data to whole-cell lysate (WCL) protein content, as this approach accounts for differences in EV production due to variations in cell number or treatment conditions and is commonly used in the field (Kowal et al., PNAS, 2016; Mathieu et al., Nat Commun, 2021). We also included Flotillin-1 and LAMP2 as EV markers, both of which have been validated as molecular markers of small EV subpopulations.

      (4) Hyper normalization in WB quantification in Figure 2E-G is statistically incorrect, as it assumes that one group (in this case, R1441G ctrl) has no variability at all, which is not biologically possible. The authors should repeat the quantification without hypernormalizing one of their groups. This issue is prevalent across the whole manuscript.

      We understand the concern regarding “hyper-normalization” (i.e., expressing all values relative to one condition set to 1), which may mask variability in the reference group. However, it is standard practice in immunoblotting analysis to express data relative to a control condition for comparison, as variations in membrane transfer, exposure time, and signal development can differ across blots. In our case, the data are expressed as relative levels (arbitrary units) rather than absolute quantitative values. To facilitate comparison between datasets and account for inter-experimental variation, we continued to express values relative to the mutant LRRK2 MEF condition.

      On the other hand, in lipidomics experiments, despite using the same number of seeded cells and identical extraction and analysis protocols, minor biological and technical variability was observed across independent replicates. This variability is inherent to the experimental system and is now explicitly represented in the new table included in Supplemental Figure 1F, which compiles three independent representative lipidomics experiments showing quantitative BMP levels across different conditions.

      (5) The authors perform a t-test in Figure 2E-G when comparing more than 2 groups, which is wrong. The authors should use a two-way ANOVA as they are comparing genotype and treatment.

      We appreciate the reviewer’s comment and agree with this observation. The MLi-2 and CBE experiments were performed independently and in separate experimental runs; therefore, we have reanalyzed these datasets separately rather than combining them in a two-way ANOVA. To properly compare more than two groups within each dataset, we have now applied a Kruskal-Wallis test followed by an uncorrected Dunn’s post hoc test (Figure 2 D-F and H-J). This non-parametric approach is more appropriate for our data structure, as EV experiments are usually subject to high variability and immunoblot quantifications involving small sample sizes (n≈6) do not always meet the assumptions of normality or equal variance. The Kruskal-Wallis test does not assume normality or equal variances, making it more robust for small, variable biological datasets. The statistical analyses and figure legend have been updated in the revised manuscript accordingly.

      In addition, since our CBE treatments yielded statistically non-significant data, we have softened our conclusions throughout the manuscript concerning the contribution of GCase activity to EV-mediated BMP release modulation.

      (6) There is a very strong reduction in flotillin-1 in R1441G cells vs WT (Figure 2G) in the EV fraction. That reduction is further exacerbated with MLi2, which likely means it is not kinase activity dependent. Can the authors comment on that?

      We agree with the reviewer that Flotillin-1 showed a different behavior compared with LAMP2 in these experiments. As recommended by the MISEV guidelines (Théry C, et al. J Extracell Vesicles. 2018;  7:1535750; Welsh JA, et al. J Extracell Vesicles. 2024, 13:e12404), it is important to analyze more than one EV-associated protein marker. We examined LAMP2, which, together with LAMP1, has been reported to be specifically enriched in EVs of endolysosomal origin (exosomes; Mathieu et al., Nat Commun. 2021, 12:4389 ). In contrast, Flotillin-1 is also associated with small EVs but may represent a distinct EV subpopulation from those positive for LAMP proteins (Kowal J, et al. PNAS 2016, 113:E968-E977).

      Nevertheless, the biochemical analysis of isolated EV fractions was complemented by our lipidomics data and, in the revised version, by TIRF microscopy analysis of exosome release in control and G2019S LRRK2 human fibroblasts (new Figure 7, Panels G-I; Videos 1 and 2). In this analysis, we confirmed increased exocytosis of CD63-pHluorin– positive endolysosomes in G2019S LRRK2 human fibroblasts compared to controls, an effect that was reversed by MLi-2 treatment. The CD63-pHluorin–positive compartment of these cells was also largely positive for BMP (new Figure 7G). Collectively, these findings further support the regulatory role of LRRK2 activity in EV-mediated BMP secretion.

      (7) In Figure 2C, the authors should express that the LAMP2-EV and flotillin-1 EV fractions from the WB are highly exposed. As presently presented, it is slightly misleading.

      We thank the reviewer for this comment. In EV preparations, the amount of protein recovered is typically very low. Therefore, although we loaded all the EV protein obtained from each sample, the immunoblots for LAMP2 and Flotillin-1 in EV fractions required longer exposure times to visualize clear signals across all conditions. We have now indicated in the corresponding figure legend that these EV blots are long-exposure blots to facilitate signal detection and avoid any potential misunderstanding.

      (8) If Figure 2C and D are from two different experiments, they should not be plotted together in Figure 2E-G. You cannot compare the effect of MLi2 vs CBE if done in completely different experiments.

      We appreciate the reviewer’s comment and agree with this observation. The MLi-2 and CBE experiments were performed independently and in separate experimental runs; therefore, we have reanalyzed these datasets separately rather than combining them in a two-way ANOVA. To properly compare more than two groups within each dataset, we have now applied a Kruskal-Wallis test followed by an uncorrected Dunn’s post hoc test (Figure 2 D-F and H-J). This non-parametric approach is more appropriate for our data structure, as EV experiments are usually subject to high variability and immunoblot quantifications involving small sample sizes (n≈6) do not always meet the assumptions of normality or equal variance. The Kruskal-Wallis test does not assume normality or equal variances, making it more robust for small, variable biological datasets. The revised statistical analyses and figure legends have been updated accordingly in the manuscript.

      (9) The authors state that "For the R1441G MEF cells, MLi-2 decreased EV concentration while CBE increased EV particles per ml, in agreement with the effects observed in our biochemical analysis." As Figure S1D shows no statistical significance, the authors don't have sufficient evidence to make this claim.

      We apologize for this overstatement. We have revised the text to clarify that, although the differences did not reach statistical significance, a consistent trend toward decreased EV concentration upon MLi-2 treatment and increased EV release following CBE treatment was observed in R1441G MEF cells.

      (10) "Altogether, given that BMP is specifically enriched in ILVs (which become exosomes upon release), the data presented above support our biochemical analysis (Figure 2C, D, F) and suggest a role for LRRK2 and GCase in modulating BMP release in association with LAMP2-positive exosomes from MEF cells." As Figure 3E shows no statistical difference of BMP on EVs upon CBE treatment, this sentence is not accurate and should be reframed. Furthermore, the authors claim an increase in EV-LAMP2 in R1441G cells compared to WT, however, the amount of BMP in EVs of R1441G cells vs WT is unchanged with a non-significant reduction. This contradiction does not support the authors' conclusions and really puts into question their whole model.

      We thank the reviewer for this observation. After reanalyzing our biochemical data from isolated EV fractions (see new Panels D-F and H-J) using an improved statistical approach, we found that although EV-associated LAMP2 levels were consistently elevated in untreated R1441G LRRK2 MEFs compared to WT cells, CBE treatment only produced a non-significant trend toward increased EV-associated LAMP2 compared to untreated R1441G LRRK2 cells. Accordingly, we have revised the sentence to read as follows:

      “Altogether, given that BMP is specifically enriched in ILVs (which become exosomes upon release), the data presented above support our biochemical analysis (Figure 2C, E, G, I) and suggest that LRRK2 activity regulates BMP release in association with LAMP2positive exosomes, whereas GCase activity appears to have a more variable effect under the tested conditions.”

      We also agree with the reviewer that, in our MEF model, the amount of BMP in EVs of R1441G cells vs WT is unchanged with a non-significant reduction. However, pharmacological modulation supports our conclusion that BMP release is modulated by LRRK2 activity. Specifically, treatment with the LRRK2 inhibitor MLi-2 decreased EVassociated BMP and LAMP2 levels in R1441G LRRK2 MEFs, and our new data (new Figure 7, Panel G-I; Videos 1 and 2) show increased exocytosis of CD63-pHluorin– positive endolysosomes in G2019S LRRK2 human fibroblasts compared to controls, an effect that was reversed by MLi-2 treatment. The CD63-pHluorin–positive compartment of these cells was also largely positive for BMP (new Figure 7G).

      In light of the reviewer’s comment about CBE treatment, we have softened our conclusions throughout the manuscript concerning the contribution of GCase activity in this model.

      (11) In Figure 5, 16 h of MLi2 treatment is too long and can lead to off-target effects. I would advise reducing it to 1-4 h.

      Prolonged MLi-2 treatments have been extensively used in the field without evidence of significant off-target effects. Several studies, including Fell et al. (2015, J Pharmacol Exp Ther 355:397-409), De Wit et al. (2019, Mol Neurobiol 56:5273-5286), Ho et al. (2022, NPJ Parkinson’s Dis 8:115), Tengberg et al. (2024, Neurobiol Dis 202:106728), and Jaimon et al. (2025, Sci Signal 18:eads5761), have applied long-term (24-48 h) MLi-2 treatments at comparable concentrations without detecting toxicity or off-target alterations, including in MEFs (Ho et al., 2022; Dhekne et al., 2018, eLife 7:e40202). Moreover, the data presented in Figure 5 demonstrate a reduction in CLN5 protein levels in both MEFs and human fibroblasts following MLi-2 treatment, confirming the specificity of the observed effects in LRRK2 mutant cells.

      (12) "Our data suggest that BMP is exocytosed in association with EVs and that LRRK2 and GCase activities modulate BMP secretion." Again, cells carrying the R1441G mutation have the same amount of BMP in EVs than WT. This sentence is not factually accurate. Accordingly, CBE did not change the amount of BMP in EVs.

      We thank the reviewer for this observation and agree that, in our MEF model, the amount of BMP in EVs from R1441G LRRK2 cells is comparable to that observed in WT cells. However, pharmacological modulation supports our conclusion that BMP release is modulated by LRRK2 activity. Specifically, treatment with the LRRK2 inhibitor MLi-2 decreased EV-associated BMP levels in R1441G LRRK2 MEFs, and our new data (new Figure 7G-I; Videos 1 and 2) show increased exocytosis of CD63-pHluorin–positive endolysosomes in G2019S LRRK2 human fibroblasts compared to controls, an effect that was reversed by MLi-2 treatment. The CD63-pHluorin–positive compartment of these cells was also largely positive for BMP (new Figure 7G). These findings further support the regulatory role of LRRK2 activity in EV-mediated BMP secretion. In addition, in light of the reviewer’s comment about CBE treatment, we have softened our conclusions throughout the paper concerning the contribution of GCase activity in this model.

      (13) Figure 6; EV release should have been monitored by more accurate markers such as CD63 and CD81.

      We thank the reviewer for this comment. We and others (Kowal et al., 2016; Lu et al., 2018; Mathieu et al., 2021; Ferreira et al., 2022) have reported enrichment of Flotillin-1 and LAMP proteins in isolated small EV fractions. In particular, one of these studies (Mathieu et al., Nat Commun. 2021), in which bafilomycin A1 was also used (to boost exosome release), reported that “LAMP1-positive subpopulations of EVs represent MVB/lysosome-derived exosomes, which also contain syntenin-1.” Altogether, our choice of EV markers (LAMP2 and Flotillin-1) is consistent with those previously and accurately used to characterize EVs. We have now included all relevant references in the revised manuscript to further clarify this point.

      (14) Figure 6 suggests that exosomal BMP is controlled by EV release. I would think that is rather obvious.

      We agree that the finding that exosomal BMP release is influenced by EV secretion may appear “obvious.” However, our intention in Figure 6 was to provide direct experimental evidence confirming this relationship using pharmacological modulators of EV release. Specifically, inhibition of EV secretion with GW4869 reduced exosomal BMP levels, whereas stimulation with bafilomycin A1 increased them. These data were important to establish a causal link between EV trafficking and BMP export, thereby validating our model and supporting the interpretation that LRRK2 regulates BMP homeostasis through EV-mediated exocytosis, which is further modulated, to some extent, by GCase activity. 

      Minor concerns:

      (1) Figure 1: Change colors to be color blind friendly.

      We thank the reviewer for this helpful suggestion. We have adjusted the colors in Figure 1 to be color-blind friendly. In addition, we have applied the same color-blind friendly palette to the new immunofluorescence data presented in new Figure 7, Panel A and D.

      (2) More consistency on "Xmin" vs "X min" would be appreciated.

      We thank the reviewer for this observation. We have revised the manuscript to ensure consistent formatting of time indications throughout the text and figures, using the standardized format “X min.”

      Reviewer #2 (Recommendations for the authors):

      (1)  Figure 2C-D. Were equal amounts of protein loaded in each lane?

      Equal protein amounts were loaded in lanes corresponding to whole-cell lysate (WCL) fractions and normalized based on α-Tubulin levels.

      For the extracellular vesicle (EV) fractions, all protein recovered from EV pellets after isolation was loaded. In all EV-related experiments, we seeded the same number of EVproducing cells per condition, and the resulting EV-derived data (from both immunoblotting and lipidomics analyses) were normalized to the corresponding whole cell lysate (WCL) protein content to ensure comparability across conditions.

      All these technical details have been included in the Materials section of our revised manuscript.

      (2) The authors refer to the papers of Medoh et al (ref 43) and Singh et al. (44) for the key role of CLN5 in the BMP biosynthetic pathway. However, Medoh et al reported that CLN5 is the lysosomal BMP synthase. In contrast, Singh et al. reported that PLD3 and PLD4 mediate the synthesis of SS-BMP, and did not find any role for CLN5. 

      To avoid any confusion or misinterpretation of our findings regarding CLN5 and given that we do not analyze PLD3 or PLD4 in our study, we have decided to replace the reference to Singh et al. with Bulfon D. et al. (Nat. Commun. 2024, 15:9937) instead. This last work, conducted by an independent group distinct from the one that originally described CLN5, also validated CLN5 as the sole BMP synthase in cells.

      Also, authors mention that bafilomycin A1 (B-A1) dramatically boosts EV exocytosis, referring to Kowal et al., 2016 (ref 35) and Lu et al., 2018 (ref 45). However, this is not shown in Kowal et al.

      We thank the reviewer for pointing out this mistake. We apologize for the incorrect citation and have now corrected the reference. The statement regarding the effect of bafilomycin A1 on EV exocytosis now appropriately refers to Mathieu et al., 2021 and Lu et al., 2018.

      (3) Page 7, it is stated that "No statistically significant differences in intracellular BMP levels were observed in WT LRRK2 MEFs upon LRRK2 or GCase inhibition(Supplemental Figure 1D, E)". The authors probably mean "Supplemental Figure 1F, G"

      We thank the reviewer for noting this error. We have corrected the text to refer to panels F and G of Supplemental Figure 1, which correspond to the relevant data. We have also revised the reference to panel I of Supplemental Figure 1 accordingly.

    1. Author response:

      eLife Assessment

      This useful study raises interesting questions but provides inadequate evidence of an association between atovaquone-proguanil use (as well as toxoplasmosis seropositivity) and reduced Alzheimer's dementia risk. The findings are intriguing but they are correlative and hypothesis-generating with the strong possibility of residual confounding.

      We thank the editors and reviewers for characterizing our work as useful and for the opportunity to publish a Reviewed Preprint with a corresponding response. However, the statements in the Assessment characterizing the evidence as ‘inadequate’ and asserting a ‘strong possibility of residual confounding’ are factually incorrect as applied to our data and incompatible with the empirical findings presented in the manuscript. We have notified the editors of this factual inaccuracy. As the Assessment will be published as originally written, we provide clarification here to ensure an accurate scientific record for readers of the Reviewed Preprint.

      Our study shows that the association between atovaquone–proguanil (A/P) exposure and reduced dementia risk, first identified in a rigorously matched national cohort in Israel, is robustly reproduced across three independently constructed age-stratified cohorts in the U.S. TriNetX network (with exposure at ages 50–59, 60–69, and 70–79). In each cohort, individuals exposed to A/P were compared with rigorously matched individuals who received another medication at the same age and were then followed over a decade for incident dementia. Cases and controls were matched on all major established dementia risk factors: age, sex, race/ethnicity, diabetes, hypertension, obesity, and smoking status.

      Across all three strata, each containing more than 10,000 exposed individuals with an equal number of matched controls, we observed substantial and consistent reductions in cumulative dementia incidence (HR 0.34–0.51), extremely low P-values (10<sup>–16</sup> to 10<sup>–40</sup>), and continuously widening divergence of Kaplan–Meier curves over the follow-up period. To more rigorously exclude the possibility of unmeasured baseline differences in health status, we additionally performed, for the purpose of this response, comparative analyses of key indicators of frailty and clinical utilization, including emergency and inpatient encounters, as well as the prevalence of mild cognitive impairment prior to medication exposure (values provided below in response to Reviewer #2, Weakness 1). These analyses provide clear evidence showing no pattern suggestive of exposed individuals being medically or cognitively healthier at baseline.

      Taken together, these findings constitute a rigorously matched and independently replicated association across two national health systems, using TriNetX, the most widely cited real-world evidence platform in published cohort studies. Replication across three age strata, each with >10,000 exposed individuals, followed for a decade, and matched on all major known risk factors for dementia, meets the accepted epidemiologic definition of strong and reproducible evidence.

      Although we disagree with elements of the editorial Assessment that appear inconsistent with the empirical findings, we will proceed with publication of the current manuscript as a Reviewed Preprint in order to ensure timely dissemination of findings with meaningful implications for public health and dementia prevention. In this initial public version, the point-by-point responses below provide concise explanations addressing the critiques underlying the Assessment. A revised manuscript, incorporating expanded baseline comparisons across each TriNetX age stratum, additional stringent exclusions, and an expanded discussion that will address the remarks presented in this review, will be submitted shortly.

      Reviewer #1 (Public review):

      Summary:

      This useful study provides incomplete evidence of an association between atovaquone-proguanil use (as well as toxoplasmosis seropositivity) and reduced Alzheimer's dementia risk. The study reinforces findings that VZ vaccine lowers AD risk and suggests that this vaccine may be an effect modifier of A-P's protective effect. Strengths of the study include two extremely large cohorts, including a massive validation cohort in the US. Statistical analyses are sound, and the effect sizes are significant and meaningful. The CI curves are certainly impressive.

      Weaknesses include the inability to control for potentially important confounding variables. In my view, the findings are intriguing but remain correlative / hypothesis generating rather than causative. Significant mechanistic work needs to be done to link interventions which limit the impact of Toxoplasmosis and VZV reactivation on AD.

      We thank the reviewer for describing our study as useful and for highlighting several of its strengths, including the very large cohorts, sound statistical analyses, meaningful effect sizes, and the impressive CI curves. We also appreciate the reviewer’s recognition that our findings reinforce prior evidence linking VZV vaccination to reduced AD risk.

      Regarding the statement that the evidence remains incomplete due to “inability to control for potentially important confounding variables,” we refer to our introductory explanation above. As noted there, our analyses meet the accepted criteria for reproducible epidemiological evidence, and the assumption of uncontrolled confounding is contradicted by rigorous matching and by additional baseline evaluations. We fully agree that mechanistic work is warranted, and our epidemiologic findings strongly motivate such efforts.

      We address the reviewer’s specific comments in detail below.

      (1) Most of the individuals in the study received A-P for malaria prophylaxis as it is not first line for Toxo treatment. Many (probably most) of these individuals were likely to be Toxo negative (~15% seropositive in the US), thereby eliminating a potential benefit of the drug in most people in the cohort. Finally, A-P is not a first line treatment for Toxo because of lower efficacy.

      We agree that individuals in our cohort received Atovaquone-Proguanil (A-P) for malaria prophylaxis rather than for treatment of toxoplasmosis. However, this does not contradict our interpretation. Because latent CNS colonization by T. gondii is not currently considered clinically actionable, asymptomatic carriers are not offered treatment, and therefore would only receive an anti-Toxoplasma regimen unintentionally, through a medication prescribed for another indication such as malaria prophylaxis. Importantly, atovaquone is an established therapy for toxoplasmosis, including CNS disease, with documented efficacy and CNS penetration in current treatment guidelines. It is therefore reasonable to assume that, during the multi-week course typically administered for malaria prophylaxis, A-P would exert significant anti-Toxoplasma activity in individuals with latent CNS infection, potentially reducing or eliminating parasite burden even though the medication was not prescribed for that purpose.

      The reviewer notes that only ~15% of individuals in the U.S. are Toxoplasma-seropositive, based on surveys performed primarily in young adults of reproductive age (serologic testing is most commonly obtained in women during prenatal care). However, seropositivity increases cumulatively over the lifespan, and few reliable estimates exist for the age groups in which Alzheimer’s disease and dementia occur. Even if we accept the lower estimate of ~15% latent colonization in older adults, this proportion is still smaller than the lifetime cumulative incidence of dementia in the general population.

      Therefore, if latent toxoplasmosis contributes causally to dementia risk, and A-P is capable of eliminating latent Toxoplasma in the subset of individuals who harbor it, then a multi-week course of treatment—such as the one routinely taken for malaria prophylaxis—would be expected to produce a substantial reduction in dementia incidence at the population level, of the same order of magnitude reported here. A protective effect concentrated in a minority of exposed individuals is fully compatible with, and can mechanistically explain, the large overall reduction in risk that we observe.

      Finally, the reviewer notes that A-P is not a first-line treatment for toxoplasmosis due to assumed lower efficacy. This point does not undermine our results. Even a second-line agent, when administered over several weeks—as is routinely done for malaria prophylaxis—is expected to exert substantial anti-Toxoplasma activity. The long duration of exposure in large populations receiving A-P for travel provides a unique natural experiment that does not exist for other anti-Toxoplasma medications, which, when prescribed for their non-Toxoplasma indications, are not taken more than a few days. Thus, the widespread use of A-P for malaria prophylaxis allows a unique opportunity to evaluate long-term outcomes following inadvertent anti-Toxoplasma treatment.

      Moreover, “first line” recommendations in clinical guidelines refer to treatment of acute toxoplasmosis in immunosuppressed individuals, where tachyzoites are actively replicating. These guidelines do not consider efficacy against latent CNS colonization, which is dominated by bradyzoites, a biologically distinct form, in immunocompetent individuals. Therefore, the guideline hierarchy is not informative regarding which medication is more effective at clearing latent brain infection, the stage we consider most relevant to dementia risk.

      (2) A-P exposure may be a marker of subtle demographic features not captured in the dataset such as wealth allowing for global travel and/or genetic predisposition to AD. This raises my suspicion of correlative rather than casual relationships between A-P exposure and AD reduction. The size of the cohort does not eliminate this issue, but rather narrows confidence intervals around potentially misleading odds ratios which have not been adjusted for the multitude of other variables driving incident AD.

      We agree that prior to matching, A-P exposure may be associated with demographic features such as health or to travel internationally. However, this does not apply after matching. In all age-stratified analyses, exposed and control individuals were rigorously matched on all major risk factors known to influence dementia risk, including age, sex, race/ethnicity, smoking status, hypertension, diabetes, and obesity. Owing to the extremely large pool of individuals in TriNetX (~120M), our matching was performed stringently, producing exposed and unexposed cohorts that are near-identical with respect to the established determinants of dementia risk.

      The reviewer correctly identifies that large cohorts alone do not eliminate confounding; however, confounding must still be biologically and epidemiologically plausible. Any hypothetical confounder capable of producing a 50–70% reduction in dementia incidence over a decade would need to: (1) produce a very large protective effect against dementia; (2) be strongly associated with A-P exposure; and (3) remain entirely uncorrelated with age, sex, race/ethnicity, smoking, diabetes, hypertension and obesity, which have been rigorously matched. No such factor has been proposed. The suggestion that an unspecified ‘subtle demographic feature’ could produce effects of this magnitude remains hypothetical, and no such factor has been described in the dementia risk literature.

      If a specific evidence-supported confounder is proposed that meets these criteria, we would be pleased to test it empirically in our cohorts. In the absence of such a proposal, the interpretation that the association is merely “correlative rather than causal” remains speculative and does not negate the strength of a replicated, rigorously matched, long-term association across large cohorts in two national health systems.

      (3) The relationship between herpes virus reactivation and Toxo reactivation seems speculative.

      We respectfully disagree with the characterization of the herpesvirus–Toxoplasma interaction as speculative. The mechanism we describe is biologically valid, based on established virology and parasitology literature showing that latent T. gondii infection can reactivate from its bradyzoite state under inflammatory or immune-modifying conditions, including viral triggers. A published clinical report has documented CNS co-reactivation of T. gondii and a herpesvirus, explicitly noting that HHV-6 reactivation can promote Toxoplasma reactivation in neural tissue (Chaupis et al., Int J Infect Dis, 2016).

      Moreover, this mechanism is the only currently evidence-supported explanation that simultaneously and parsimoniously accounts for all of the epidemiologic observations in our study:

      (1) Substantially higher cumulative incidence of dementia in individuals with positive Toxoplasma serology, indicating that latent infection is a risk factor for subsequent cognitive decline;

      (2) Strong protective association following A-P exposure, a medication with established activity against Toxoplasma gondii, including in the CNS;

      (3) Independent protection conferred by VZV vaccination, observed consistently for two vaccines with distinct formulations (one live attenuated, one recombinant protein), whose only shared property is suppression of VZV reactivation;

      (4) Greater protective effect of A-P among individuals who were not vaccinated against VZV, consistent with a model in which dementia risk requires both herpesvirus reactivation and persistent latent Toxoplasma infection—such that reducing either factor alone (via VZV vaccination or anti-Toxoplasma suppression) substantially lowers risk.

      Taken together, these observations are difficult to reconcile under any alternative hypothesis.  

      To date, we are unaware of any other biologically coherent mechanism that can explain all four findings simultaneously. We would welcome any alternative explanation capable of accounting for these converging epidemiologic signals, as such a proposal could meaningfully advance the scientific discussion. In the absence of a competing explanation, the interaction between latent toxoplasmosis and herpesvirus reactivation remains the most parsimonious hypothesis supported by current knowledge.

      Finally, while observational studies are inherently limited in their ability to provide causal inference, the mechanism we propose is biologically grounded and experimentally testable. Our results provide a strong rationale for mechanistic studies and clinical trials, and warrant publication precisely because they generate a verifiable hypothesis that can now be evaluated directly.

      (4) A direct effect on A-P on AD lesions independent on infection is not considered as a hypothesis. Given the limitations above and effects on metabolic pathways, it probably should be. The Toxo hypothesis would be more convincing if the authors could demonstrate an enhanced effect of the drug in Toxo positive individuals without no effect in Toxo negative individuals.

      A direct effect of A-P on AD established lesions is indeed possible, and this hypothesis would be of significant therapeutic interest. However, we did not consider it within the scope of our epidemiologic analyses because all cohorts explicitly excluded individuals with existing dementia. Under these conditions, proposing a disease-modifying effect on established Alzheimer’s lesions based on our data would itself be speculative. Evaluating such a mechanism would be better answered by mechanistic or interventional studies rather than inference from populations without baseline disease.

      We also agree that demonstrating a stronger protective effect among Toxoplasma-positive individuals would be informative. Unfortunately, this “natural experiment” cannot be performed using the available data: Toxoplasma serology is rarely ordered in older adults, and A-P exposure is itself uncommon, resulting in a cohort overlap far too small to yield valid statistical inference (n≈25 in TriNetX).

      Thus, while both proposed hypotheses are scientifically attractive and merit further study, neither can be resolved using currently available real-world clinical data. Our findings provide the rationale to investigate both hypotheses experimentally, and we hope our report will motivate such studies.

      Reviewer #2 (Public review):

      Summary:

      This manuscript examines the association between atovaquone/proguanil use, zoster vaccination, toxoplasmosis serostatus and Alzheimer's Disease, using 2 databases of claims data. The manuscript is well written and concise. The major concerns about the manuscript center around the indications of atovaquone/proguanil use, which would not typically be active against toxoplasmosis at doses given, and the lack of control for potential confounders in the analysis.

      Strengths:

      (1) Use of 2 databases of claims data.

      (2) Unbiased review of medications associated with AD, which identified zoster vaccination associated with decreased risk of AD, replicating findings from other studies.

      We thank the reviewer for the thoughtful assessment and for noting key strengths of our work, including (1) the use of two large national databases, and (2) the unbiased discovery approach that replicated the widely reported association between zoster vaccination and reduced Alzheimer’s disease (AD) risk. We agree that these features highlight the validity and reproducibility of the analytic framework.

      Below we respond to the reviewer’s perceived weaknesses.

      Weaknesses:

      (1) Given that atovaquone/proguanil is likely to be given to a healthy population who is able to travel, concern that there are unmeasured confounders driving the association.

      We agree that, prior to matching, A-P exposure may correlate with demographic or health-related differences (e.g., ability to travel). However, this potential bias was explicitly controlled for in the study design. Across all three age-stratified TriNetX cohorts, exposed and unexposed individuals were rigorously matched on all major established dementia risk factors: age, sex, race/ethnicity, smoking status, obesity, diabetes mellitus, and hypertension. Comparative analyses confirm that these risk factors are equivalently distributed at baseline.

      As noted in our response to Reviewer #1, for any hypothetical unmeasured confounder to explain the results, it would need to satisfy three conditions simultaneously:

      (1) Be capable of producing a 50–70% reduction in dementia incidence sustained over a decade and across three distinct age strata (ages 50–79);

      (2) Be strongly associated with likelihood of receiving A-P;

      (3) Remain entirely uncorrelated with age, sex, race/ethnicity, smoking, diabetes, hypertension, or obesity, all of which were rigorously matched and balanced at baseline.

      No such factor has been proposed in the literature or by the reviewer. Thus, the concern remains hypothetical and unsupported by any measurable demographic or biological mechanism.

      Importantly, empirical evidence contradicts the notion of a “healthy traveler” bias:

      Emergency and inpatient encounter rates prior to exposure were comparable between A-P users and controls. Across the three age-stratified cohorts, emergency visits were similar or slightly higher among A-P users (EMER: 19.6% vs 16.4%, 19.9% vs 14.2%, 22.0% vs 14.8%), and inpatient encounters were effectively equivalent (IMP: 14.8% vs 15.2%, 17.7% vs 17.6%, 22.1% vs 22.2%). These patterns directly contradict the suggestion that A-P users were a healthier or less medically burdened population at baseline.

      Prevalence of mild cognitive impairment was not lower among A-P users and was, in fact, slightly higher in the oldest cohort. Across the three age groups, baseline diagnoses of mild cognitive impairment (MCI) were comparable or slightly higher among exposed individuals (0.1% vs 0.1%, 0.3% vs 0.2%, 1.1% vs 0.6%). These data contradict the suggestion that A-P users had superior baseline cognition.

      The strongest protective association occurred in the youngest stratum (age 50–59; HR 0.34). At this age, when nearly all individuals are sufficiently healthy to travel internationally, A-P uptake is the least likely to confound health status. A frailty-based “healthy traveler” hypothesis would instead predict the opposite pattern, with older adults showing the greatest apparent benefit, since health limitations are more likely to restrict travel in later life. In contrast, the protective association weakens with increasing age, empirically contradicting any explanation based on differential travel capacity.

      In conclusion, the empirical evidence directly contradicts the existence of a ‘healthy traveler’ effect.

      (2) The dose of atovaquone in atovaquone/proguanil is unlikely to be adequate suppression of toxo (much less for treatment/elimination of toxo), raising questions about the mechanism.

      A few important points should address the reviewer’s concern:

      In our cohorts, A-P was prescribed for malaria prophylaxis, as correctly noted. In this setting, it is taken for the entire duration of travel, plus several days before and after, typically resulting in many weeks of continuous exposure. This creates an unintentional but scientifically valuable natural experiment, in which a CNS-penetrating anti-Toxoplasma agent is administered for long durations.

      Atovaquone is an established treatment for CNS toxoplasmosis, has strong CNS penetration, and is included in current clinical guidelines for acute toxoplasmosis in immunocompromised patients, although at higher doses. Because latent, asymptomatic CNS colonization is not treated in clinical practice, there are currently no data establishing the dose required to eliminate bradyzoite-stage Toxoplasma in immunocompetent individuals.

      Our observations concern atovaquone–proguanil (A-P), a fixed-dose combination of atovaquone with proguanil, a DHFR inhibitor targeting a key metabolic pathway shared by malaria parasites and T. gondii. The combination has well-established synergistic effects in malaria prophylaxis and the same mechanism would be expected to enhance anti-Toxoplasma activity. This fixed-dose regimen has never been formally evaluated for toxoplasmosis treatment at prolonged durations or against latent bradyzoite infection.

      Our hypothesis does not require or imply complete eradication of Toxoplasma. A clinically meaningful reduction in latent cyst burden among the subset of colonized individuals may be sufficient to alter long-term disease trajectories. Thus, a population-level decrease in dementia incidence does not require universal clearance of infection, but only partial suppression or reduction of parasite load in susceptible individuals, which is entirely compatible with the known pharmacology and duration of A-P exposure.

      (3) Unmeasured bias in the small number of people who had toxoplasma serology in the TriNetX cohort.

      The relatively small number of older adults with Toxoplasma serology stems from current clinical practice: serologic testing is mostly performed in women during reproductive years due to risks in pregnancy, whereas in older adults a positive result has no clinical consequence and therefore testing is rarely ordered.

      Importantly, the seropositive and seronegative groups were drawn from the same underlying population of individuals who underwent serology testing, and the only difference between groups is the test result itself. Because the decision to order a test is made prior to and independent of the result, there is no plausible rationale by which the serology outcome (positive or negative) would introduce a bias favoring either group beyond the result of the test itself.

      Furthermore, the two groups were here also rigorously matched on all major dementia risk factors, including age, sex, race/ethnicity, smoking, diabetes, hypertension, and BMI, and these characteristics are similarly distributed between groups. A small sample size does not imply bias; it simply reduces statistical power. Despite this limitation, the observed association (HR = 2.43, p = 0.001) remains strongly significant.

      Finally, this result is consistent with multiple published studies reporting higher rates of Toxoplasma seropositivity among individuals with Alzheimer’s disease, dementia, and even mild cognitive impairment, such that our finding reinforces a broader and independently observed epidemiologic pattern. Importantly, in our cohort the serology testing clearly preceded dementia diagnosis, which supports the plausibility of a causal rather than merely correlative relationship between latent toxoplasmosis and cognitive decline.

      To conclude our provisional response, we thank the editor and reviewers for raising points that will be further addressed and expanded upon in the discussion of the forthcoming revision. We welcome transparent scientific dialogue and acknowledge that, as with all observational research, residual confounding cannot be eliminated with absolute certainty. However, we disagree with the overall Assessment and emphasize that our findings—reproduced independently across two national health systems and three age-stratified cohorts, each rigorously matched on all major determinants of dementia risk, meet, and in many respects exceed, current standards for high-quality observational evidence.

      Assigning the results to “residual confounding” requires more than speculation: it requires identification of a confounding factor that is (1) anchored in established dementia risk literature, (2) empirically plausible, and (3) quantitatively capable of generating a sustained ~50 percent reduction in dementia incidence over a decade. No such factor has been identified to date. We note that the assertion of “residual confounding” has not been supported by a specific, quantitatively plausible mechanism. A hypothetical bias that is both extremely large in effect and uncorrelated with all major risk factors is not statistically or biologically credible.

      The explanation we propose, reduction in dementia risk through elimination of latent Toxoplasma gondii, is biologically grounded, directly supported by independent epidemiologic literature, and uniquely capable of accounting for all convergent observations in our data. No alternative hypothesis has been put forward that can plausibly explain these findings.

      A revised version of the manuscript will be submitted shortly, incorporating expanded baseline analyses, with the strictest possible exclusion criteria (including congenital, vascular, chromosomal, and neurodegenerative disorders such as Parkinson’s disease), and complete tabulated comparisons. These data will further reinforce that the observed protective associations are not attributable to any measurable confounding. We also plan to enhance the discussion in order to address the points raised by the reviewers.

      In light of the expanded analyses, any reservations expressed in the initial Assessment can now be re-evaluated on the basis of the empirical evidence. The findings reported in our study meet, and in several respects exceed, current epidemiologic standards for high-quality observational research, clearly warrant publication, and provide a robust scientific foundation for future mechanistic and interventional studies to determine whether elimination of latent toxoplasmosis can prevent or treat dementia.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) I have to admit that it took a few hours of intense work to understand this paper and to even figure out where the authors were coming from. The problem setting, nomenclature, and simulation methods presented in this paper do not conform to the notation common in the field, are often contradictory, and are usually hard to understand. Most importantly, the problem that the paper is trying to solve seems to me to be quite specific to the particular memory study in question, and is very different from the normal setting of model-comparative RSA that I (and I think other readers) may be more familiar with.

      We have revised the paper for clarity at all levels: motivation, application, and parameterization. We clarify that there is a large unmet need for using RSA in a trial-wise manner, and that this approach indeed offers benefits to any team interested in decoding trial-wise representational information linked to a behavioral responses, and as such is not a problem specific to a single memory study.

      (2) The definition of "classical RSA" that the authors are using is very narrow. The group around Niko Kriegeskorte has developed RSA over the last 10 years, addressing many of the perceived limitations of the technique. For example, cross-validated distance measures (Walther et al. 2016; Nili et al. 2014; Diedrichsen et al. 2021) effectively deal with an uneven number of trials per condition and unequal amounts of measurement noise across trials. Different RDM comparators (Diedrichsen et al. 2021) and statistical methods for generalization across stimuli (Schütt et al. 2023) have been developed, addressing shortcomings in sensitivity. Finally, both a Bayesian variant of RSA (Pattern component modelling, (Diedrichsen, Yokoi, and Arbuckle 2018) and an encoding model (Naselaris et al. 2011) can effectively deal with continuous variables or features across time points or trials in a framework that is very related to RSA (Diedrichsen and Kriegeskorte 2017). The author may not consider these newer developments to be classical, but they are in common use and certainly provide the solution to the problems raised in this paper in the setting of model-comparative RSA in which there is more than one repetition per stimulus.

      We appreciate the summary of relevant literature and have included a revised Introduction to address this bounty of relevant work. While much is owed to these authors, new developments from a diverse array of researchers outside of a single group can aid in new research questions, and should always have a place in our research landscape. We owe much to the work of Kriegeskorte’s group, and in fact, Schutt et al., 2023 served as a very relevant touchpoint in the Discussion and helped to highlight specific needs not addressed by the assessment of the “representational geometry” of an entire presented stimulus set. Principal amongst these needs is the application of trial-wise representational information that can be related to trial-wise behavioral responses and thus used to address specific questions on brain-behavior relationships. We invite the Reviewer to consider the utility of this shift with the following revisions to the Introduction.

      Page 3. “Recently, methodological advancements have addressed many known limitations in cRSA. For example, cross-validated distance measures (e.g., Euclidean distance) have improved the reliability of representational dissimilarities in the presence of noise and trial imbalance (Walther et al., 2016; Nili et al., 2014; Diedrichsen et al., 2021). Bayesian approaches such as pattern component modeling (Diedrichsen, Yokoi, & Arbuckle, 2018) have extended representational approaches to accommodate continuous stimulus features or temporal variation. Further, model comparison RSA strategies (Diedrichsen et al., 2021) and generalization techniques across stimuli (Schütt et al., 2023) have improved sensitivity and inference. Nevertheless, a common feature shared across most of improvements is that they require stimuli repetition to examine the representational structure. This requirement limits their ability to probe brain-behavior questions at the level of individual events”.

      Page 8. “While several extensions of RSA have addressed key limitations in noise sensitivity, stimulus variance, and modeling (e.g., Diedrichsen et al., 2021; Schütt et al., 2023), our tRSA approach introduces a new methodological step by estimating representational strength at the trial level. This accounts for the multi-level variance structure in the data, affords generalizability beyond the fixed stimulus set, and allows one to test stimulus- or trial-level modulations of neural representations in a straightforward way”.

      Page 44. “Despite such prevalent appreciation for the neurocognitive relevance of stimulus properties, cRSA often does not account for the fact that the same stimulus (e.g., “basketball”) is seen by multiple subjects and produces statistically dependent data, an issue addressed by Schütt et al., 2023, who developed cross validation and bootstrap methods that explicitly model dependence across both subjects and stimulus conditions”.

      (3) The stated problem of the paper is to estimate "representational strength" in different regions or conditions. With this, the authors define the correlation of the brain RDM with a model RDM. This metric conflates a number of factors, namely the variances of the stimulus-specific patterns, the variance of the noise, the true differences between different dissimilarities, and the match between the assumed model and the data-generating model. It took me a long time to figure out that the authors are trying to solve a quite different problem in a quite different setting from the model-comparative approach to RSA that I would consider "classical" (Diedrichsen et al. 2021; Diedrichsen and Kriegeskorte 2017). In this approach, one is trying to test whether local activity patterns are better explained by representation model A or model B, and to estimate the degree to which the representation can be fully explained. In this framework, it is common practice to measure each stimulus at least 2 times, to be able to estimate the variance of noise patterns and the variance of signal patterns directly. Using this setting, I would define 'representational strength" very differently from the authors. Assume (using LaTeX notation) that the activity patterns $y_j,n$ for stimulus j, measurement n, are composed of a true stimulus-related pattern ($u_j$) and a trial-specific noise pattern ($e_j,n$). As a measure of the strength of representation (or pattern), I would use an unbiased estimate of the variance of the true stimulus-specific patterns across voxels and stimuli ($\sigma^2_{u}$). This estimator can be obtained by correlating patterns of the same stimuli across repeated measures, or equivalently, by averaging the cross-validated Euclidean distances (or with spatial prewhitening, Mahalanobis distances) across all stimulus pairs. In contrast, the current paper addresses a specific problem in a quite specific experimental design in which there is only one repetition per stimulus. This means that the authors have no direct way of distinguishing true stimulus patterns from noise processes. The trick that the authors apply here is to assume that the brain data comes from the assumed model RDM (a somewhat sketchy assumption IMO) and that everything that reduces this correlation must be measurement noise. I can now see why tRSA does make some sense for this particular question in this memory study. However, in the more common model-comparative RSA setting, having only one repetition per stimulus in the experiment would be quite a fatal design flaw. Thus, the paper would do better if the authors could spell the specific problem addressed by their method right in the beginning, rather than trying to set up tRSA as a general alternative to "classical RSA".

      At a general level, our approach rests on the premise that there is meaningful information present in a single presentation of a given stimulus. This assumption may have less utility when the research goals are more focused on estimating the fidelity of signal patterns for RSA, as in designs with multiple repetitions. But it is an exaggeration to state that such a trial-wise approach cannot address the difference between “true” stimulus patterns and noise. This trial-wise approach has explicit utility in relating trial-wise brain information to trial-wise behavior, across multiple cognitions (not only memory studies, as applied here). We have added substantial text to the Introduction distinguishing cRSA, which is widely employed, often in cases with a single repetition per stimulus, and model comparative methods that employ multiple repetitions. We clarify that we do not consider tRSA an alternative to the model comparative approach, and discuss that operational definitions of representational strength are constrained by the study design.

      Page 3. “In this paper, we present an advancement termed trial-level RSA, or tRSA, which addresses these limitations in cRSA (not model comparison approaches) and may be utilized in paradigms with or without repeated stimuli”.

      Page 4. “Representational geometry usually refers to the structure of similarities among repeated presentations of the same stimulus in the neural data (as captured in the brain RSM) and is often estimated utilizing a model comparison approach, whereas representational strength is a derived measure that quantifies how strongly this geometry aligns with a hypothesized model RSM. In other words, geometry characterizes the pattern space itself, while representational strength reflects the degree of correspondence between that space and the theoretical model under test”.

      Finally, we clarified that in our simulation methods we assume a true underlying activity pattern and a random error pattern. The model RSM is computed based on the true pattern, whereas the brain RSM comes from the noisy pattern, not the model RSM itself.

      Page 9. “Then, we generated two sets of noise patterns, which were controlled by parameters σ<sub>A</sub> and σ<sub>B</sub> , respectively, one for each condition”.

      (4) The notation in the paper is often conflicting and should be clarified. The actual true and measured activity patterns should receive a unique notation that is distinct from the variances of these patterns across voxels. I assume that $\sigma_ijk$ is the noise variances (not standard deviation)? Normally, variances are denoted with $\sigma^2$. Also, if these are variances, they cannot come from a normal distribution as indicated on page 10. Finally, multi-level models are usually defined at the level of means (i.e., patterns) rather than at the level of variances (as they seem to be done here).

      We have added notations for true and measured activity patterns to differentiate it from our notation for variance. We agree that multilevel models are usually defined at the level of means rather than at the level of variances and we include a Figure (Fig 1D) that describes the model in terms of the means. We clarify that the σ ($\sigma$) used in the manuscript were not variances/standard deviations themselves; rather, they were meant to denote components of the actual (multilevel) variance parameter. Each component was sampled from normal distributions, and they collectively summed up to comprise the final variance parameter for each trial. We have modified our notation for each component to the lowercase letter s to minimize confusion. We have also made our R code publicly available on our lab github, which should provide more clarity on the exact simulation process.

      (5) In the first set of simulations, the authors sampled both model and brain RSM by drawing each cell (similarity) of the matrix from an independent bivariate normal distribution. As the authors note themselves, this way of producing RSMs violates the constraint that correlation matrices need to be positive semi-definite. Likely more seriously, it also ignores the fact that the different elements of the upper triangular part of a correlation matrix are not independent from each other (Diedrichsen et al. 2021). Therefore, it is not clear that this simulation is close enough to reality to provide any valuable insight and should be removed from the paper, along with the extensive discussion about why this simulation setting is plainly wrong (page 21). This would shorten and clarify the paper.

      We have added justification of the mixed-effects model given the potential assumption violations. We caution readers to investigate the robustness of their models, and to employ permutation testing that does not make independence assumptions. We have also added checks of the model residuals and an example of permutation testing in the Appendix. Finally, we agree that the first simulation setting does not possess several properties of realistic RDMs/RSMs; however, we believe that there is utility in understanding the mathematical properties of correlations – an essential component of RSA – in a straightforward simulation where the ground truth is known, thus moving the simulation to Appendix 1.

      (6) If I understand the second simulation setting correctly, the true pattern for each stimulus was generated as an NxP matrix of i.i.d. standard normal variables. Thus, there is no condition-specific pattern at all, only condition-specific noise/signal variances. It is not clear how the tRSA would be biased if there were a condition-specific pattern (which, in reality, there usually is). Because of the i.i.d. assumption of the true signal, the correlations between all stimulus pairs within conditions are close to zero (and only differ from it by the fact that you are using a finite number of voxels). If you added a condition-specific pattern, the across-condition RSA would lead to much higher "representational strength" estimates than a within-condition RSA, with obvious problems and biases.

      The Reviewer is correct that the voxel values in the true pattern are drawn from i.i.d. standard normal distributions. We take the Reviewer’s suggestion of “condition-specific pattern” to mean that there could be a condition-voxel interaction in two non-mutually exclusive ways. The first is additive, essentially some common underlying multi-voxel pattern like [6, 34, -52, …, 8] for all condition A trials, and different one such pattern for condition B trials, etc. The second is multiplicative, essentially a vector of scaling factors [x1.5, x0.5, x0.8, …, x2.7] for all condition A trials, and a different one such vector for condition B trials, etc. Both possibilities could indeed affect tRSA as much as it would cRSA.

      Importantly, If such a strong condition-specific pattern is expected, one can build a condition-specific model RDM using one-shot coding of conditions (see example figure; src: https://www.newbi4fmri.com/tutorial-9-mvpa-rsa), to either capture this interesting phenomenon or to remove this out as a confounding factor. This practice has been applied in multiple regression cRSA approaches (e.g., Cichy et al., 2013) and can also be applied to tRSA.

      (7) The trial-level brain RDM to model Spearman correlations was analyzed using a mixed effects model. However, given the symmetry of the RDM, the correlations coming from different rows of the matrix are not independent, which is an assumption of the mixed effect model. This does not seem to induce an increase in Type I errors in the conditions studied, but there is no clear justification for this procedure, which needs to be justified.

      We appreciate this important warning, and now caution readers to investigate the robustness of their models, and consider employing permutation testing that does not make independence assumptions. We have also added checks of the model residuals and an example of permutation testing in the supplement.

      Page 46. “While linear mixed-effects modeling offers a powerful framework for analyzing representational similarity data, it is critical that researchers carefully construct and validate their models. The multilevel structure of RSA data introduces potential dependencies across subjects, stimuli, and trials, which can violate assumptions of independence if not properly modeled. In the present study, we used a model that included random intercepts for both subjects and stimuli, which accounts for variance at these levels and improves the generalizability of fixed-effect estimates. Still, there is a potential for systematic dependence across trials within a subject. To ensure that the model assumptions were satisfied, we conducted a series of diagnostic checks on an exemplar ROI (right LOC; middle occipital gyrus) in the Object Perception dataset, including visual inspection of residual distributions and autocorrelation (Appendix 3, Figure 13). These diagnostics supported the assumptions of normality, homoscedasticity, and conditional independence of residuals. In addition, we conducted permutation-based inference, similar to prior improvements to cRSA (Niliet al. 2014), using a nested model comparison to test whether the mean similarity in this ROI was significantly greater than zero. The observed likelihood ratio test statistic fell in the extreme tail of the null distribution (Appendix 3, Figure 14), providing strong nonparametric evidence for the reliability of the observed effect. We emphasize that this type of model checking and permutation testing is not merely confirmatory but can help validate key assumptions in RSA modeling, especially when applying mixed-effects models to neural similarity data. Researchers are encouraged to adopt similar procedures to ensure the robustness and interpretability of their findings”.

      Exemplar Permutation Testing

      To test whether the mean representational strength in the ROI right LOC (middle occipital gyrus) was significantly greater than zero, we used a permutation-based likelihood ratio test implemented via the permlmer function. This test compares two nested linear mixed-effects models fit using the lmer function from the lme4 package, both including random intercepts for Participant and Stimulus ID to account for between-subject and between-item variability.

      The null model excluded a fixed intercept term, effectively constraining the mean similarity to zero after accounting for random effects:

      ROI ~ 0 + (1 | Participant) + (1 | Stimulus)

      The full model included the same random effects structure but allowed the intercept to be freely estimated:

      ROI ~ 1 + (1 | Participant) + (1 | Stimulus)

      By comparing the fit of these two models, we directly tested whether the average similarity in this ROI was significantly different from zero. Permutation testing (1,000 permutations) was used to generate a nonparametric p-value, providing inference without relying on normality assumptions. The full model, which estimated a nonzero mean similarity in the right LOC (middle occipital gyrus), showed a significantly better fit to the data than the null model that fixed the mean at zero (χ²(1) = 17.60, p = 2.72 × 10⁻⁵). The permutation-based p-value obtained from permlmer confirmed this effect as statistically significant (p = 0.0099), indicating that the mean similarity in this ROI was reliably greater than zero. These results support the conclusion that the right LOC contains representational structure consistent with the HMAXc2 RSM. A density plot of the permuted likelihood ratio tests is plotted along with the observed likelihood ratio test in Appendix 3 Figure 14.

      (8) For the empirical data, it is not clear to me to what degree the "representational strength" of cRSA and tRSA is actually comparable. In cRSA, the Spearman correlation assesses whether the distances in the data RSM are ranked in the same order as in the model. For tRSA, the comparison is made for every row of the RSM, which introduces a larger degree of flexibility (possibly explaining the higher correlations in the first simulation). Thus, could the gains presented in Figure 7D not simply arise from the fact that you are testing different questions? A clearer theoretical analysis of the difference between the average row-wise Spearman correlation and the matrix-wise Spearman correlation is urgently needed. The behavior will likely vary with the structure of the true model RDM/RSM.

      We agree that the comparability between mean row-wise Spearman correlations and the matrix-wise Spearman correlation is needed. We believe that the simulations are the best approach for this comparison, since they are much more robust than the empirical dataset and have the advantage of knowing the true pattern/noise levels. We expand on our comparison of mean tRSA values and matrix-wise Spearman correlations on page 42.

      Page 42. “Although tRSA and cRSA both aim to quantify representational strength, they differ in how they operationalize this concept. cRSA summarizes the correspondence between RSMs as a single measure, such as the matrix-wise Spearman correlation. In contrast, tRSA computes such correspondence for each trial, enabling estimates at the level of individual observations. This flexibility allows trial-level variability to be modeled directly, but also introduces subtle differences in what is being measured. Nonetheless, our simulations showed that, although numerical differences occasionally emerged—particularly when comparing between-condition tRSA estimates to within-condition cRSA estimates—the magnitude of divergence was small and did not affect the outcome of downstream statistical tests”.

      (9) For the real data, there are a number of additional sources of bias that need to be considered for the analysis. What if there are not only condition-specific differences in noise variance, but also a condition-specific pattern? Given that the stimuli were measured in 3 different imaging runs, you cannot assume that all measurement noise is i.i.d. - stimuli from the same run will likely have a higher correlation with each other.

      We recognize the potential of condition-specific patterns and chose to constrain the analyses to those most comparable with cRSA. However, depending on their hypotheses, researchers may consider testing condition RSMs and utilizing a model comparison approach or employ the z-scored approach, as employed in the simulations above. Regarding the potential run confounds, this is always the case in RSA and why we exclude within-run comparisons. We have also added to the Discussion the suggestion to include run as a covariate in their mixed-effects models. However, we do not employ this covariate here as we preferred the most parsimonious model to compare with cRSA.

      Page 46 - 47. “Further, while analyses here were largely employed to be comparable with cRSA, researchers should consider taking advantage of the flexibility of the mixed-effects models and include co variates of non-interest (run, trial order etc.)”.

      (10) The discussion should be rewritten in light of the fact that the setting considered here is very different from the model-comparative RSA in which one usually has multiple measurements per stimulus per subject. In this setting, existing approaches such as RSA or PCM do indeed allow for the full modelling of differences in the "representational strength" - i.e., pattern variance across subjects, conditions, and stimuli.

      We agree that studies advancing designs with multiple repetitions of a given stimulus image are useful in estimating the reliability of concept representations. We would argue however that model comparison in RSA is not restricted to such data. Many extant studies do not in fact have multiple repetitions per stimulus per subject (Wang et al., 2018 https://doi.org/10.1088/1741-2552/abecc3, Gao et al, 2022 https://doi.org/10.1093/cercor/bhac058, Li et al, 2022 https://doi.org/10.1002/hbm.26195, Staples & Graves, 2020 https://doi.org/10.1162/nol_a_00018) that allow for that type of model-comparative approach. While beneficial in terms of noise estimation, having multiple presentations was not a requirement for implementing cRSA (Kriegeskorte, 2008 https://doi.org/10.3389/neuro.06.004.2008). The aim of this manuscript is to introduce the tRSA approach to the broad community of researchers whose research questions and datasets could vary vastly, including but not limited to the number of repeated presentations and the balance of trial counts across conditions.

      (11) Cross-validated distances provide a powerful tool to control for differences in measurement noise variances and possible covariances in measurement noise across trials, which has many distinct advantages and is conceptually very different from the approach taken here.

      We have added language on the value of cross-validation approaches to RSA in the Discussion:

      Page 47. “Additionally, we note that while our proposed tRSA framework provides a flexible and statistically principled approach for modeling trial-level representational strength, we acknowledge that there are alternative methods for addressing trial-level variability in RSA. In particular, the use of cross-validated distance metrics (e.g., crossnobis distance) has become increasingly popular for controlling differences in measurement noise variance and accounting for possible covariance structures across trials (Walther et al., 2016). These metrics offer several advantages, including unbiased estimation of representational dissimilarities under Gaussian noise assumptions and improved generalization to unseen data. However, cross-validated distances are conceptually distinct from the approach taken here: whereas cross-validation aims to correct for noise-related biases in representational dissimilarity matrices, our trial-level RSA method focuses on estimating and modeling the variability in representation strength across individual trials using mixed-effects modeling. Rather than proposing a replacement for cross-validated RSA, tRSA adds a complementary tool to the methodological toolkit—one that supports hypothesis-driven inference about condition effects and trial-level covariates, while leveraging the full structure of the data”.

      (12) One of the main limitations of tRSA is the assumption that the model RDM is actually the true brain RDM, which may not be the case. Thus, in theory, there could be a different model RDM, in which representational strength measures would be very different. These differences should be explained more fully, hopefully leading to a more accessible paper.

      Indeed, the chosen model RSM may not be the true RSM, but as the noise level increases the correlation between RSMs practically becomes zero. In our simulations we assume this to be true as a straightforward way to manipulate the correspondence between the brain data and the model. However, just like cRSA, tRSA is constrained by the model selections the researchers employ. We encourage researchers to have carefully considered theoretically-motivated models and, if their research questions require, consider multiple and potentially competing models. Furthermore, the trial-wise estimates produced by tRSA encourage testing competing models within the multiple regression framework. We have added this language to the Discussion.

      Page 46. ..”choose their model RSMs carefully. In our simulations, we designed our model RSM to be the “true” RSM for demonstration purposes. However, researchers should consider if their models and model alternatives”.

      Pages 45-46. “While a number of studies have addressed the validity of measuring representational geometry using designs with multiple repetitions, a conceptual benefit of the tRSA approach is the reliance on a regression framework that engenders the testing of competing conceptual models of stimulus representation (e.g., taxonomic vs. encyclopedic semantic features, as in Davis et al., 2021)”.

      Reviewer #2 (Public review):

      (1)  While I generally welcome the contribution, I take some issue with the accusatory tone of the manuscript in the Introduction. The text there (using words such as 'ignored variances', 'errouneous inferences', 'one must', 'not well-suited', 'misleading') appears aimed at turning cRSA in a 'straw man' with many limitations that other researchers have not recognized but that the new proposed method supposedly resolves. This can be written in a more nuanced, constructive manner without accusing the numerous users of this popular method of ignorance.

      We apologize for the unintended accusatory tone. We have clarified the many robust approaches to RSA and have made our Introduction and Discussion more nuanced throughout (see also 3, 11 and16).

      (2) The described limitations are also not entirely correct, in my view: for example, statistical inference in cRSA is not always done using classic parametric statistics such as t-tests (cf Figure 1): the rsatoolbox paper by Nili et al. (2014) outlines non-parametric alternatives based on permutation tests, bootstrapping and sign tests, which are commonly used in the field. Nor has RSA ever been conducted at the row/column level (here referred to by the authors as 'trial level'; cf King et al., 2018).

      We agree there are numerous methods that go beyond cRSA addressing these limitations and have added discussion of them into our manuscript as well as an example analysis implementing permutation tests on tRSA data (see response to 7). We thank the reviewer for bringing King et al., 2014 and their temporal generalization method to our attention, we added reference to acknowledge their decoding-based temporal generalization approach.

      Page 8. “It is also important to note that some prior work has examined similarly fine-grained representations in time-resolved neuroimaging data, such as the temporal generalization method introduced by King et al. (see King & Dehaene, 2014). Their approach trains classifiers at each time point and tests them across all others, resulting in a temporal generalization matrix that reflects decoding accuracy over time. While such matrices share some structural similarity with RSMs, they do not involve correlating trial-level pattern vectors with model RSMs nor do their second-level models include trial-wise, subject-wise, and item-wise variability simultaneously”.

      (3) One of the advantages of cRSA is its simplicity. Adding linear mixed effects modeling to RSA introduces a host of additional 'analysis parameters' pertaining to the choice of the model setup (random effects, fixed effects, interactions, what error terms to use) - how should future users of tRSA navigate this?

      We appreciate the opportunity to offer more specific proscriptions for those employing a tRSA technique, and have added them to the Discussion:

      Page 46. “While linear mixed-effects modeling offers a powerful framework for analyzing representational similarity data, it is critical that researchers carefully construct and validate their models and choose their model RSMs carefully. In our simulations, we designed our model RSM to be the “true” RSM for demonstration purposes. However, researchers should consider if their models and model alternatives. However, researchers should always consider if their models match the goals of their analysis, including 1) constructing the random effects structure that will converge in their dataset and 2) testing their model fits against alternative structures (Meteyard & Davies, 2020; Park et al., 2020) and 3) considering which effects should be considered random or fixed depending on their research question”.

      (4) Here, only a single real fMRI dataset is used with a quite complicated experimental design for the memory part; it's not clear if there is any benefit of using tRSA on a simpler real dataset. What's the benefit of tRSA in classic RSA datasets (e.g., Kriegeskorte et al., 2008), with fixed stimulus conditions and no behavior?

      To clarify, our empirical approach uses two different tasks: an Object Perception task more akin to the classic RSA datasets employing passive viewing, and a Conceptual Retrieval task that more directly addresses the benefits of the trialwise approach. We felt that our Object Perception dataset is a simpler empirical fMRI dataset without explicit task conditions or a dichotomous behavioral outcome, whereas the Retrieval dataset is more involved (though old/new recognition is the most common form of memory retrieval testing) and  dependent on behavioral outcomes. However, we recognize the utility of replication from other research groups and do invite researchers to utilize tRSA on their datasets.

      (5) The cells of an RDM/RSM reflect pairwise comparisons between response patterns (typically a brain but can be any system; cf Sucholutsky et al., 2023). Because the response patterns are repeatedly compared, the cells of this matrix are not independent of one another. Does this raise issues with the validity of the linear mixed effects model? Does it assume the observations are linearly independent?

      We recognize the potential danger for not meeting model assumptions. Though our simulation results and model checks suggest this is not a fatal flaw in the model design, we caution readers to investigate the robustness of their models, and consider employing permutation testing that does not make independence assumptions. We have also added checks of the model residuals and an example of permutation testing in the Appendix. See response to R1.

      (6) The manuscript assumes the reader is familiar with technical statistical terms such as Type I/II error, sensitivity, specificity, homoscedasticity assumptions, as well as linear mixed models (fixed effects, random effects, etc). I am concerned that this jargon makes the paper difficult to understand for a broad readership or even researchers currently using cRSA that might be interested in trying tRSA.

      We agree this jargon may cause the paper to be difficult to understand. We have expanded/added definitions to these terms throughout the methods and results sections.

      Page 12. “Given data generated with 𝑠<sub>𝑐𝑜𝑛𝑑,𝐴</sub> = 𝑠<sub>𝑐𝑜𝑛𝑑,B</sub>, the correct inference should be a failure to reject the null hypothesis of ; any significant () result in either direction was considered a false positive (spurious effect, or Type I error). Given data generated with , the inference was considered correct if it rejected the null hypothesis of  and yielded the expected sign of the estimated contrast (b<sub>B-𝐴</sub><0). A significant result with the reverse sign of the estimated contrast (b<sub>B-𝐴</sub><0) was considered a Type I error, and a nonsignificant (𝑝 ≥ 0.05) result was considered a false negative (failure to detect a true effect, or Type II error)”.

      Page 2. “Compared to cRSA, the multi-level framework of tRSA was both more theoretically appropriate and significantly sensitive (better able to detect) to true effects”.

      Page 25.”The performance of cRSA and tRSA were quantified with their specificity (better avoids false positives, 1 - Type I error rate) and sensitivity (better avoids false negatives 1 - Type II error rate)”.

      Page 6. “One of the fundamental assumptions of general linear models (step 4 of cRSA; see Figure 1D) is homoscedasticity or homogeneity of variance — that is, all residuals should have equal variance” .

      Page11. “Specifically, a linear mixed-effects model with a fixed effect  of condition (which estimates the average effect across the entire sample, capturing the overall effect of interest) and random effects of both subjects and stimuli (which model variation in responses due to differences between individual subjects and items, allowing generalization beyond the sample) were fitted to tRSA estimates via the `lme4 1.1-35.3` package in R (Bates et al., 2015), and p-values were estimated using Satterthwaites’s method via the `lmerTest 3.1-3` package (Kuznetsova et al., 2017)”.

      (7) I could not find any statement on data availability or code availability. Given that the manuscript reuses prior data and proposes a new method, making data and code/tutorials openly available would greatly enhance the potential impact and utility for the community.

      We thank the reviewer for raising our oversight here. We have added our code and data availability statements.

      Page 9. “Data is available upon request to the corresponding author and our simulations and example tRSA code is available at https://github.com/electricdinolab”.

      Reviewer #1 (Recommendations for the authors):

      (13) Page 4: The limitations of cRSA seem to be based on the assumption that within each different experimental condition, there are different stimuli, which get combined into the condition. The framework of RSA, however, does not dictate whether you calculate a condition x condition RDM or a larger and more complete stimulus x stimulus RDM. Indeed, in practice we often do the latter? Or are you assuming that each stimulus is only shown once overall? It would be useful at this point to spell out these implicit assumptions.

      We agree that stimulus x stimulus RDMs can be constructed and are often used. However, as we mentioned in the Introduction, researchers are often interested in the difference between two (or more) conditions, such as “remembered” vs. “forgotten” (Davis et al., https://doi.org/10.1093/cercor/bhaa269) or “high cognitive load” vs. “low cognitive load” (Beynel et al., https://doi.org/10.1523/JNEUROSCI.0531-20.2020). In those cases, the most common practice with cRSA is to construct condition-specific RDMs, compute cRSA scores separately for each condition, and then compare the scores at the group level. The number of times each stimulus gets presented does not prevent one from creating a model RDM that has the same rows and columns as the brain RDM, either in the same condition (“high load”) or across different conditions.

      (14) Page 5: The difference between condition-level and stimulus-level is not clear. Indeed, this definition seems to be a function of the exact experimental design and is certainly up for interpretation. For example, if I conduct a study looking at the activity patterns for 4 different hand actions, each repeated multiple times, are these actions considered stimuli or conditions?

      We have added clarifying language about what is considered stimuli vs conditions. Indeed, this will depend on the specific research questions being employed and will affect how researchers construct their models. In this specific example, one would most likely consider each different hand action a condition, treating them as fixed effects rather than random effects, given their very limited number and the lack of need to generalize findings to the broader “hand actions” category.

      Page 5. “Critically, the distinction between condition-level and stimulus level is not always clear as researchers may manipulate stimulus-level features themselves. In these cases, what researchers ultimately consider condition-level and stimulus-level will depend on their specific research questions. For example, researchers intending to study generalized object representation may consider object category a stimulus-level feature, while researchers interested in if/how object representation varies by category may consider the same category variable condition-level”.

      (15) Page 5: The fact that different numbers of trials / different levels of measurement noise / noise-covariance of different conditions biases non-cross-validated distances is well known and repeatedly expressed in the literature. We have shown that cross-validation of distances effectively removes such biases - of course, it does not remove the increased estimation variability of these distances (for a formal analysis of estimation noise on condition patterns and variance of the cross-nobis estimator, see (Diedrichsen et al. 2021)).

      We thank the reviewer for drawing our attention to this literature and have added discussions of these methods.

      (16). Page 5: "Most studies present subjects with a fixed set of stimuli, which are supposedly samples representative of some broader category". This may be the case for a certain type of RSA experiments in the visual domain, but it would be unfair to say that this is a feature of RSA studies in general. In most studies I have been involved in, we use a "stimulus" x "stimulus" RDM.

      We have edited this sentence to avoid the “most” characterization. We also added substantial text to the introduction and discussion distinguishing cRSA, which is nonetheless widely employed, especially in cases with a single repetition per stimulus (Macklin et al., 2023, Liu et al, 2024) and the model comparative method and explicitly stating that we do not consider tRSA an alternative to the model comparative approach.

      (17). Page 5: I agree that "stimuli" should ideally be considered a random effect if "stimuli" can be thought of as sampled from a larger population and one wants to make inferences about that larger population. Sometimes stimuli/conditions are more appropriately considered a fixed effect (for example, when studying the response to stimulation of the 5 fingers of the right hand). Techniques to consider stimuli/conditions as a random effect have been published by the group of Niko Kriegeskorte (Schütt et al. 2023).

      Indeed, in some cases what may be thought of as “stimuli” would be more appropriately entered into the model as a fixed effect; such questions are increasingly relevant given the focus on item-wise stimulus properties (Bainbridge et al., Westfall & Yarkoni). We have added text on this issue to the Discussion and caution researchers to employ models that most directly answer their research questions.

      Page 46. “However, researchers should always consider if their models match the goals of their analysis, including 1) constructing the random effects structure that will converge in their dataset and 2) testing their model fits against alternative structures (Meteyard & Davies, 2020; Park et al., 2020) and 3) considering which effects should be considered random or fixed depending on their research question. An effect is fixed when the levels represent the specific conditions of theoretical interest (e.g., task condition) and the goal is to estimate and interpret those differences directly. In contrast, an effect is random when the levels are sampled from a broader population (e.g., subjects) and the goal is to account for their variability while generalizing beyond the sample tested. Note that the same variable (e.g., stimuli) may be considered fixed or random depending on the research questions”.

      (18) Page 6: It is correct that the "classical" RSA depends on a categorical assignment of different trials to different stimuli/conditions, such that a stimulus x stimulus RDM can be computed. However, both Pattern Component Modelling (PCM) and Encoding models are ideally set up to deal with variables that vary continuously on a trial-by-trial or moment-by-moment basis. tRSA should be compared to these approaches, or - as it should be clarified - that the problem setting is actually quite a different one.

      We agree that PCM and encoding models offer a flexible approach and handle continuous trial-by-trial variables. We have clarified the problem setting in cRSA is distinct on page 6, and we have added the robustness of encoding models and their limitations to the Discussion.

      Page 6. “While other approaches such as Pattern Component Modeling (PCM) (Diedrichsen et al., 2018) and encoding models (Naselaris et al., 2011) are well-suited to analyzing variables that vary continuously on a trial-by-trial or moment-by-moment basis, these frameworks address different inferential goals. Specifically, PCM and encoding models focus on estimating variance components or predicting activation from features, while cRSA is designed to evaluate representational geometry. Thus, cRSA as well as our proposed approach address a problem setting distinct from PCM and encoding models”.

      (19) Page 8: "Then, we generated two noise patterns, which were controlled by parameters 𝜎 𝐴 and 𝜎𝐵, respectively, one for each condition." This makes little sense to me. The noise patterns should be unique to each trial - you should generate n_a + n_b noise patterns, no?

      We clarify that the “noise patterns” here are n_voxel x n_trial in size; in other words, all trial-level noise patterns are generated together and each trial has their own unique noise pattern. We have revised our description as “two sets of noise patterns” for clarity starting on page 9.

      (20) Page 9: First, I assume if this is supposed to be a hierarchical level model, the "noise parameters" here correspond to variances? Or do these \sigma values mean to signify standard deviations? The latter would make little sense. Or is it the noise pattern itself?

      As clarified in 4., the σ values are meant to denote hierarchical components of the composite standard deviation; we have updated our notation to use lower case letter s instead for clarity.

      (21) Page 10: your formula states "𝜎<sub>𝑠𝑢𝑏𝑗</sub>~ 𝙽(0, 0.5^2)". This conflicts with your previous mention that \sigmas are noise "levels" are they the noise patterns themselves now? Variances cannot be normally distributed, as they cannot be negative.

      As clarified in 4., the σ values are meant to denote hierarchical components of the composite standard deviation; we have updated our notation to use lower case letter s instead for clarity.

      (22) Page 13: What was the task of the subject in the Memory retrieval task? Old/new judgements relative to encoding of object perception?

      We apologize for the lack of clarity about the Memory Retrieval task and have added that information and clarified that the old/new judgements were relative to a separate encoding phase, the brain data for which has been reported elsewhere.

      Page 14. “Memory Retrieval took place one day after Memory Encoding and involved testing participants’ memory of the objects seen in the Encoding phase. Neural data during the Encoding phase has been reported elsewhere. In the main Memory Retrieval task, participants were presented with 144 labels of real-world objects, of which 114 were labels for previously seen objects and 30 were unrelated novel distractors. Participants performed old/new judgements, as well as their confidence in those judgements on a four-point scale (1 = Definitely New, 2 = Probably New, 3 = Probably Old, 4 = Definitely Old)”.

      (23) Page 13: If "Memory Retrieval consisted of three scanning runs", then some of the stimulus x stimulus correlations for the RSM must have been calculated within a run and some between runs, correct? Given that all within-run estimates share a common baseline, they share some dependence. Was there a systematic difference between the within-run and the between-run correlations?

      We have clarified in this portion of the methods that within run comparisons were excluded from our analyses. We also double-checked that the within-run exclusion was included in the description of the Neural RSMs.

      Page 14. “Retrieval consisted of three scanning runs, each with 38 trials, lasting approximately 9 minutes and 12 seconds (within-run comparisons were later excluded from RSA analyses)”.

      Page 18. “This was done by vectorizing the voxel-level activation values within each region and calculating their correlations using Pearson’s r, excluding all within-run comparisons.”

      (24) Page 20: It is not clear why the mean estimate of "representational strength" (i.e., model-brain RSM correlations) is important at all. This comes back to Major point #2, namely that you are trying to solve a very different problem from model-comparative RSA.

      We have clarified that our approach is not an alternative to model-comparative RSA, and that depending on the task constraints researchers may choose to compare models with tRSA or other approaches requiring stimulus repetition (see 3).

      (25) Page 21: I believe the problems of simulating correlation matrices directly in the way that the authors in their first simulation did should be well known and should be moved to an appendix at best. Better yet, the authors could start with the correct simulation right away.

      We agree the paper is more concise with these simulations being moved to the appendix and more briefly discussed. We have implemented these changes (Appendix 1). However, we are not certain that this problem is unknown, and have several anecdotes of researchers inquiring about this “alternative” approach in talks with colleagues, thus we do still discuss the issues with this method.

      (26) Page 26: Is the "underlying continuous noise variable 𝜎𝑡𝑟𝑖𝑎𝑙 that was measured by 𝑣𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 " the variance of the noise pattern or the noise pattern itself? What does it mean it was "measured" - how?

      𝜎𝑡𝑟𝑖𝑎𝑙 is a vector of standard deviations for different trials, and 𝜎𝑡𝑟𝑖𝑎𝑙 i would be used to generate the noise patterns for trial i. v_measured is a hypothetical measurement of trial-level variability, such as “memorability” or “heartbeat variability”. We have revised our description to clarify our methods.

      Reviewer #2 (Recommendations for the authors):

      (8) It would be helpful to provide more clarity earlier on in the manuscript on what is a 'trial': in my experience, a row or column of the RDM is usually referred to as 'stimulus condition', which is typically estimated on multiple trials (instances or repeats) of that stimulus condition (or exemplars from that stimulus class) being presented to the subject. Here, a 'trial' is both one measurement (i.e., single, individual presentation of a stimulus) and also an entry in the RDM, but is this the most typical scenario for cRSA? There is a section in the Discussion that discusses repetitions, but I would welcome more clarity on this from the get-go.

      We have added discussion of stimulus repetition methods and datasets to the Introduction and clarified our use of the terms.

      Page 8. “Critically, in single-presentation designs, a “trial” refers to one stimulus presentation, and corresponds to a row or column in the RSM. In studies with repeated stimuli, these rows are often called “conditions” and may reflect aggregated patterns across trials. tRSA is compatible with both cases: whether rows represent individual trials or averaged trials that create “conditions”, tRSA estimates are computed at the row level”.

      (9) The quality of the results figures can be improved. For example, axes labels are hard to read in Figure 3A/B, panels 3C/D are hard to read in general. In Figure 7E, it's not possible to identify the 'dark red' brain regions in addition to the light red ones.

      We thank the reviewer for raising these and have edited the figures to be more readable in the manner suggested.

      (10) I would be interested to see a comparison between tRSA and cRSA in other fMRI (or other modality) datasets that have been extensively reported in the literature. These could be the original Kriegeskorte 96 stimulus monkey/fMRI datasets, commonly used open datasets in visual perception (e.g., THINGS, NSD), or the above-mentioned King et al. dataset, which has been analyzed in various papers.

      We recognize the great utility of replication from other research groups and do invite researchers to utilize tRSA on their datasets.

      (11) On P39, the authors suggest 'researchers can confidently replace their existing cRSA analysis with tRSA': Please discuss/comment on how researchers should navigate the choice of modeling parameters in tRSA's linear mixed effects setting.

      We have added discussion of the mixed-effects parameters and the various and encourage researchers to follow best practices for their model selection.

      Page 46. “However, researchers should always consider if their models match the goals of their analysis, including 1) constructing the random effects structure that will converge in their dataset and 2) testing their model fits against alternative structures (Meteyard & Davies, 2020; Park et al., 2020) and 3) considering which effects should be considered random or fixed depending on their research question”.

      (12) The final part of the Results section, demonstrating the tRSA results for the continuous memorability factor in the real fMRI data, could benefit from some substantiation/elaboration. It wasn't clear to me, for example, to what extent the observed significant association between representational strength and item memorability in this dataset is to be 'believed'; the Discussion section (p38). Was there any evidence in the original paper for this association? Or do we just assume this is likely true in the brain, based on prior literature by e.g. Bainbridge et al (who probably did not use tRSA but rather classic methods)?

      Indeed, memorability effects have been replicated in the literature, but not using the tRSA method. We have expanded our discussion to clarify the relationship of our findings and the relevant literature and methods it has employed.

      Page 38. “Critically, memorability is a robust stimulus property that is consistent across participants and paradigms (Bainbridge, 2022). Moreover, object memorability effects have been replicated using a variety of methods aside from tRSA, including univariate analyses and representational analyses of neural activity patterns where trial-level neural activity pattern estimates are correlated directly with object memorability (Slayton et al, 2025).”

      (13) The abstract could benefit from more nuance; I'm not sure if RSA can indeed be said to be 'the principal method', and whether it's about assessing 'quality' of representations (more commonly, the term 'geometry' or 'structure' is used).

      We have edited the abstract to reflect the true nuisance in the current approaches.

      Abstract. Neural representation refers to the brain activity that stands in for one’s cognitive experience, and in cognitive neuroscience, a prominent method of studying neural representations is representational similarity analysis (RSA). While there are several recent advances in RSA, the classic RSA (cRSA) approach examines the structure of representations across numerous items by assessing the correspondence between two representational similarity matrices (RSMs): usually one based on a theoretical model of stimulus similarity and the other based on similarity in measured neural data.

      (14) RSA is also not necessarily about models vs. neural data; it can also be between two neural systems (e.g., monkey vs. human as in Kriegeskorte et al., 2008) or model systems (see Sucholutsky et al., 2023). This statement is also repeated in the Introduction paragraph 1 (later on, it is correctly stated that comparing brain vs. model is most likely the 'most common' approach).

      We have added these examples in our introduction to RSA.

      Page 3.”One of the central approaches for evaluating information represented in the brain is representational similarity analysis (RSA), an analytical approach that queries the representational geometry of the brain in terms of its alignment with the representational geometry of some cognitive model (Kriegeskorte et al., 2008; Kriegeskorte & Kievit, 2013), or, in some cases, compares the representational geometry of two neural systems (e.g., Kriegeskorte et al., 2008) or two model systems (Sucholutsky et al., 2023)”.

      (15) 'theoretically appropriate' is an ambiguous statement, appropriate for what theory?

      We apologize for the ambiguous wording, and have corrected the text:

      Page 11. “Critically, tRSA estimates were submitted to a mixed-effects model which is statistically appropriate for modeling the hierarchical structure of the data, where observations are nested within both subjects and stimuli (Baayen et al., 2008; Chen et al., 2021)”.

      (16) I found the statement that cRSA "cannot model representation at the level of individual trials" confusing, as it made me think, what prohibits one from creating an RDM based on single-trial responses? Later on, I understood that what the authors are trying to say here (I think) is that cRSA cannot weigh the contributions of individual rows/columns to the overall representational strength differently.

      We thank the reviewer for their clarifying language and have added it to this section of the manuscript.

      “Abstract. However, because cRSA cannot weigh the contributions of individual trials (RSM rows/columns), it is fundamentally limited in its ability to assess subject-, stimulus-, and trial-level variances that all influence representation”.

      (17) Why use "RSM" instead of "RDM"? If the pairwise comparison metric is distance-based (e..g, 1-correlation as described by the authors), RDM is more appropriate.

      We apologize for the error, and have clarified the Methods text:

      Page3-4. First, brain activity responses to a series of N trials are compared against each other (typically using Pearson’s r) to form an N×N representational similarity matrix.

      (18) Figure 2: please write 'Correlation estimate' in the y-axis label rather than 'Estimate'.

      We have edited the label in Figure 2.

      (19) Page 6 'leaving uncertain the directionality of any findings' - I do not follow this argument. Obviously one can generate an RDM or RSM from vector v or vector -v. How does that invalidate drawing conclusions where one e.g., partials out the (dis)similarity in e.g., pleasantness ratings out of another RDM/RSM of interest?

      We agree such an approach does not invalidate the partial method; we have clarified what we mean by “directionality”.

      Page 8. ”For instance, even though a univariate random variable , such as pleasantness ratings, can be conveniently converted to an RSM using pairwise distance metrics (Weaverdyck et al., 2020), the very same RSM would also be derived from the opposite random variable , leaving uncertain of the directionality (or if representation is strongest for pleasant or unpleasant items) of any findings with the RSM (see also Bainbridge & Rissman, 2018)”.

      (20) P7 'sampled 19900 pairs of values from a bi-variate normal distribution', but the rows/columns in an RDM are not independent samples - shouldn't this be included in the simulation? I.e., shouldn't you simulate first the n=200 vectors, and then draw samples from those, as in the next analysis?

      This section has been moved to Appendix 1 (see responses to Reviewer 1.13).

      (21) Under data acquisition, please state explicitly that the paper is re-using data from prior experiments, rather than collecting data anew for validating tRSA.

      We have clarified this in the data acquisition section.

      Page 13. “A pre-existing dataset was analyzed to evaluate tRSA. Main study findings have been reported elsewhere (S. Huang, Bogdan, et al., 2024)”.

      (22) Figure 4 could benefit from some more explanation in-text. It wasn't clear to me, for example, how to interpret the asterisks depicted in the right part of the figure.

      We clarified the meaning of the asterisks in the main text in addition to the existent text in the figure caption.

      Page 26. “see Figure 4, off-diagonal cells in blue; asterisks indicate where tRSA was statistically more sensitive then cRSA)”.

      (23) Page 38 "the outcome of tRSA's improved characterization can be seen in multiple empirical outcomes:" it seems there is one mention of 'outcomes' too many here.

      We have revised this sentence.

      Page 41. “tRSA's improved characterization can be seen in multiple empirical outcomes”.

      (24) Page 38 "model fits became the strongest" it's not clear what aspect of the reported results in the paragraph before this is referring to - the Appendix?

      Yes, the model fits are in the Appendix, we have added this in text citation.

      Moreover, model-fits became the strongest when the models also incorporated trial-level variables such as fMRI run and reaction time (Appendix 3, Table 6).

      References

      Diedrichsen, J., Berlot, E., Mur, M., Schütt, H. H., Shahbazi, M., & Kriegeskorte, N. (2021). Comparing representational geometries using whitened unbiased-distance-matrix similarity. Neurons, Behavior, Data and Theory, 5(3). https://arxiv.org/abs/2007.02789

      Diedrichsen, J., & Kriegeskorte, N. (2017). Representational models: A common framework for understanding encoding, pattern-component, and representational-similarity analysis. PLoS Computational Biology, 13(4), e1005508.

      Diedrichsen, J., Yokoi, A., & Arbuckle, S. A. (2018). Pattern component modeling: A flexible approach for understanding the representational structure of brain activity patterns. NeuroImage, 180, 119-133.

      Naselaris, T., Kay, K. N., Nishimoto, S., & Gallant, J. L. (2011). Encoding and decoding in fMRI. NeuroImage, 56(2), 400-410.

      Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLoS Computational Biology, 10(4), e1003553.

      Schütt, H. H., Kipnis, A. D., Diedrichsen, J., & Kriegeskorte, N. (2023). Statistical inference on representational geometries. ELife, 12. https://doi.org/10.7554/eLife.82566

      Walther, A., Nili, H., Ejaz, N., Alink, A., Kriegeskorte, N., & Diedrichsen, J. (2016). Reliability of dissimilarity measures for multi-voxel pattern analysis. NeuroImage, 137, 188-200.

      King, M. L., Groen, I. I., Steel, A., Kravitz, D. J., & Baker, C. I. (2019). Similarity judgments and cortical visual responses reflect different properties of object and scene categories in naturalistic images. NeuroImage, 197, 368-382.

      Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka, J., Esteky, H., ... & Bandettini, P. A. (2008). Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron, 60(6), 1126-1141.

      Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLoS computational biology, 10(4), e1003553.

      Sucholutsky, I., Muttenthaler, L., Weller, A., Peng, A., Bobu, A., Kim, B., ... & Griffiths, T. L. (2023). Getting aligned on representational alignment. arXiv preprint arXiv:2310.13018.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #3:

      Comments on revised version:

      This revised version is in large improved and the responses to reviewers' comments are generally relevant. However, the response regarding pre-nodes is not satisfactory. I understand that the authors prefer to avoid further experimentations, but I think this is an important point that needs to be clarified. Exploring stages between E12 and E15 are therefore of importance. When carefully examining some of the figures (Fig. 1E or 2D) I think that at E15 they may well be pre-nodes formation prior to myelin deposition, on structure the authors considered to be heminodes. To be convincing they should use double or triple labeling with, in addition to the nodal proteins (ankG and/or Nav pan), a good myelin marker such as antiPLP. The rat monoclonal developed by late Pr Ikenaka would give a sharper staining than the anti MAG they used. (I assume the clone must still be available in Okazaki ).

      We appreciate your insightful comment regarding the possible presence of pre-nodal clusters along NM axons and your kind suggestion to use the PLP antibody (clone AA3; Yamamura et al., J Neurochem, 1991). We have obtained this monoclonal antibody from Dr. Kenji Tanaka previously in Okazaki and confirmed that it works well in chicken tissues. However, since this clone recognizes both PLP and DM-20 isoforms, it labels not only myelin-forming oligodendrocytes (MFOLs) but also newly formed oligodendrocytes (NFOLs) (Yokoyama et al., J Neurochem, 2025). Therefore, it is not ideal for determining whether nodal protein clusters are formed before myelin deposition.

      Instead, we performed double immunostaining for MAG and AnkG between E12 and E15 to clarify the temporal relationship between myelin maturation and node formation. The results showed that detectable AnkG clusters along NM axons began to appear very sparsely around E13, coinciding with the emergence of MAG signals, and became more prominent with development. This temporal pattern does not match the definition of pre-nodal clusters, which are formed prior to myelination.

      Although we cannot completely rule out the possibility of undetectable pre-nodal clusters or those composed of molecules other than AnkG, our results support the view that pre-nodal clusters are unlikely to play a major role in determining the regional difference in nodal spacing along NM axons. These new data have been added as Figure 2—figure supplement 1, and the relevant sections in the Results, Discussion, and Figure legend have been revised accordingly (page 5, line 4; page 10, line 7; page 29, line 1).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The authors attempted to clarify the impact of N protein mutations on ribonucleoprotein (RNP) assembly and stability using analytical ultracentrifugation (AUC) and mass photometry (MP). These complementary approaches provide a more comprehensive understanding of the underlying processes. Both SV-AUC and MP results consistently showed enhanced RNP assembly and stability due to N protein mutations.

      The overall research design appears well planned, and the experiments were carefully executed.

      Strengths:

      SV-AUC, performed at higher concentrations (3 µM), captured the hydrodynamic properties of bulk assembled complexes, while MP provided crucial information on dissociation rates and complex lifetimes at nanomolar concentrations. Together, the methods offered detailed insights into association states and dissociation kinetics across a broad concentration range. This represents a thorough application of solution physicochemistry.

      We thank the Reviewer for this positive assessment. 

      Weaknesses:

      Unlike AUC, MP observes only a part of the solution. In MP, bound molecules are accumulated on the glass surface (not dissociated), thus the concentration in solution should change as time develops. How does such concentration change impact the result shown here?

      We agree with the Reviewer that the concentration in solution above the surface will change with time; however, the impact of surface adsorption turns out to be negligible. To show this we have added a calculation as Supplementary Methods that is based on the number of imaged adsorption events, the fraction of imaged area to total surface area, and the initial sample volume and concentration. Under our experimental conditions the reduction is less than 1%, which is well within the range of experimental concentration errors.

      This is in line with the observation that surface adsorption of proteins to glass is critical and needs to be prevented when working at picomolar concentrations (Zhao H, Mayer ML, Schuck P. 2014. Analysis of protein interactions with picomolar binding affinity by fluorescence-detected sedimentation velocity. Anal Chem 86:3181–3187. doi:10.1021/ac500093m), but is ordinarily negligible when working at the mid nanomolar concentration range. The difference in the MP experiments is that where usually the surface adsorption to glass and plastic is invisible, it is being imaged and quantified in MP. The negligible impact of surface adsorption on solution concentration in typical MP experiments is also in line with the results of several studies that have successfully measured dissociation constants of binding equilibria by MP (Young G et al., Science 360 (2018) 432; Wu & Piszczeck, Anal Biochem 592 (2020) 113575; Solterman et al. Angewandte Chemie 59 (2020) 10774) with samples in the 5-50 nM range and similar experimental setup. It should be noted that in the MP experiments no surface functionalization is employed, in contrast to optical biosensors that utilize surface-immobilized ligands and polymeric matrices and thereby enhance the surface binding capacity.

      Even though this depletion effect is negligible under ordinary MP conditions, the Reviewer raises a good point and readers may have a similar question with this novel technique. For this reason, we have added in the MP section of the Methods the sentence “In either configuration, the impact of surface binding on the sample concentration is < 1% and negligible, as described in the Supplementary Methods S1.” and added the detailed calculations in the Supplement accordingly. The use of SV as a traditional, orthogonal technique and the observation of consistent results with those of MP should further dispel readers’ methodological concerns in this point.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, the authors apply a variety of biophysical and computational techniques to characterize the effects of mutations in the SARS-CoV-2 N protein on the formation of ribonucleoprotein particles (RNPs). They find convergent evolution in multiple repeated independent mutations strengthening binding interfaces, compensating for other mutations that reduce RNP stability but which enhance viral replication.

      Strengths:

      The authors assay the effects of a variety of mutations found in SARS-CoV-2 variants of concern using a variety of approaches, including biophysical characterization of assembly properties of RNPs, combined with computational prediction of the effects of mutations on molecular structures and interactions. The findings of the paper contribute to our increasing understanding of the principles driving viral self-assembly, and increase the foundation for potential future design of therapeutics such as assembly inhibitors.

      Thank you for highlighting the strengths of our paper and the potential impact on future design of therapeutics.

      Weaknesses:

      For the most part, the paper is well-written, the data presented support the claims made, and the arguments are easy to follow. However, I believe that parts of the presentation could be substantially improved. I found portions of the text to be overly long and verbose and likely could be substantially edited; the use of acronyms and initialisms is pervasive, making parts of the exposition laborious to follow; and portions of the figures are too small and difficult to read/understand.

      We are glad the Reviewer concurs the data support our conclusions, and finds the arguments easy to follow.  We appreciate the comment that the work was not optimally presented. To address this point, we have identified multiple opportunities to streamline the text without jeopardizing the clarity. We have also rewritten the end of the Introduction.

      As recommended, we have reduced and harmonized the use of acronyms and abbreviations throughout the text to improve readability. Specifically, we have now spelled out nucleic acid (NA), intrinsically disordered regions (IDR), full-length (FL), AlphaFold (AF3), and variants of concern (VOC).

      Finally, we have improved the presentation of most figures, adding labels and new panels, and increased the label font sizes to facilitate more detailed inspections of the data.

      Reviewer #3 (Public Review):

      This manuscript investigates how mutations in the SARS-CoV-2 nucleocapsid protein (N) alter ribonucleoprotein (RNP) assembly, stability, and viral fitness. The authors focus on mutations such as P13L, G214C, and G215C, combining biophysical assays (SV-AUC, mass photometry, CD spectroscopy, EM), VLP formation, and reverse genetics. They propose that SARS-CoV-2 exploits "fuzzy complex" principles, where distributed weak interfaces in disordered regions allow both stability and plasticity, with measurable consequences for viral replication.

      Strengths:

      (1) The paper demonstrates a comprehensive integration of structural biophysics, peptide/protein assays, VLP systems, and reverse genetics.

      (2) Identification of both de novo (P13L) and stabilizing (G214C/G215C) interfaces provides a mechanistic insight into RNP formation.

      (3) Strong application of the "fuzzy complex" framework to viral assembly, showing how weak/disordered interactions support evolvability, is a significant conceptual advance in viral capsid assembly.

      (4) Overall, the study provides a mechanistic context for mutations that have arisen in major SARS-CoV-2 variants (Omicron, Delta, Lambda) and a mechanistic basis for how mutations influence phenotype via altered biomolecular interactions.

      We are grateful for these comments highlighting this work as a significant conceptual advance.

      Weaknesses:

      (1) The arrangement of N dimers around LRS helices is presented in Figure 1C, but the text concedes that "the arrangement sketched in Figure 1C is not unique" (lines 144-146) and that AF3 modeling attempts yielded "only inconsistent results" (line 149).

      The authors should therefore present the models more cautiously as hypotheses instead. Additional alternative arrangements should be included in the Supplementary Information, so the readers do not over-interpret a single schematic model.

      We agree that in the absence of high-resolution structures the RNP models are hypothetical, and have now emphasized this in the Results, following the Reviewer’s recommendation. To present alternative arrangements that satisfy the biophysical constraints upfront, we have promoted the previous Supplementary Figure 11 showing different models to the first Supplementary Figure, and expanded it with examples of different oligomers. In this way it is referenced early on in the Results and in the legend to Figure 1C. We agree this strengthens the manuscript, as one of the take-home messages is the inherent polydispersity of the RNPs.

      The fact that AF3 can only provide inconsistent results will not come as a surprise, given the substantial disordered regions of the complex, and is a drawback of AF3 rather than our structural model. We slightly emphasized this point so as to clarify that the presentation of the AF3-based RNP structure serves solely as supporting evidence that our hypothetical model is sterically reasonable.

      The new Results paragraph reads:

      “As suggested in the cartoon of Figure 1C, this supports the hypothesis of a three-dimensional arrangement with a central LRS oligomer with symmetry properties and dimensions similar to low resolution EM images of model RNPs (Carlson et al., 2022, 2020) and cryo-ET of RNPs in virions (Klein et al., 2020; Yao et al., 2020).  It should be noted, however, that the arrangement sketched in Figure 1C is not unique and other subunit orientations could be envisioned that satisfy all constraints from experimentally observed binding interfaces, including different oligomers and anti-parallel subunits as illustrated in Supplementary Figure S1. Extending previous ColabFold structural predictions that show multiple N-protein dimers self-assembled via the LRS coiled-coils (Zhao et al., 2023), we attempted the AlphaFold modeling of RNPs combining multiple N dimers with SL7 RNA ligands, mimicking our biophysical assembly model. Current AlphaFold restrictions limit the prediction to pentamers of N-protein dimers with 10 copies of SL7 RNA. While only inconsistent results were obtained – which is not surprising given the large intrinsically disordered regions exceed the predictive power of AlphaFold – some models did produce an overall RNP organization similar to Figure 1C, suggesting such an arrangement is at least sterically reasonable with regard to possible N-protein subunit orientations in an RNP (Supplementary Figure S2)”

      (2) Negative-stained EM fibrils (Figure 2A) and CD spectra (Figure 2B) are presented to argue that P13L promotes β-sheet self-association. However, the claim could benefit from more orthogonal validation of β-sheet self-association. Additional confirmation via FTIR spectra or ThT fluorescence could be used to further distinguish structured β-sheets from amorphous aggregation.

      We completely agree that the application of multiple orthogonal biophysical methods can strengthen the conclusions. In addition to EM fibrils and CD spectra (a classical gold standard technique for protein secondary structure in solution), we already have support from ColabFold modeling, as well as NMR results from the Zweckstetter lab showing the potential for for β-sheet-like conformations.

      Furthermore, we believe the evidence for the absence of ‘amorphous aggregates’ is very strong, as this would be inconsistent with the long-range order required to create the visibly fibrillar morphology in EM, and amorphous aggregates would be inconsistent with the increased solution viscosity. In this context, it is also highly relevant that the β-sheet-like secondary structure recorded by CD is concentration-dependent and reversible upon dilution. The long-range spatial order of fibrils is consistent with the formation of secondary structure in solution.

      In addition, it must be kept in mind that what we see is specific to N-arm peptides carrying the P13L mutation (in EM, CD, and structural prediction) and does not occur in the other two N-arm peptides (ancestral N-arm and N-arm with deletion of 31-33), linker peptides, or C-arm peptides.

      Most importantly, as elaborated in more detail below, we do not claim that fibril formation is physiologically relevant. At the heart of this – in the context of the evolution of fuzzy complexes – is that the P13L mutation creates additional weak protein-protein interactions. Indeed, the assembly of fibrils geometrically requires at least two interfaces for each subunit. These weak interactions are at play physiologically in the context of the disordered RNP particles, and in macromolecular condensates, but not in the formation of fibrils. Therefore, while we appreciate the suggestion for FTIR spectra ThT staining, we are afraid further emphasis on the fibril structure might confuse the reader, and therefore we would rather clarify upfront that these fibrillar assemblies are not thought to form in vivo from full-length protein, but merely demonstrate the presence of N-arm self-association interfaces in the model of truncated peptides.

      Accordingly, we have amended the Results paragraph reporting the fibrils:

      “Thus, the N-arm mutation P13L is responsible for the formation of fibrils in N-arm peptides after prolonged storage. Some of these N-arm fibrils exhibit a twisted morphology with width of »5 nm (Figure 2A), in some instances exhibiting patterns of strand breaks. Such fibrils are frequently encountered in proteins that can stack β-sheets, such as in amyloids (Paravastu et al., 2008). While we have not observed fibril formation in the context of full-length N, and have no evidence such fibrils are physiologically relevant, their occurrence in solutions of truncated N-arm peptide nonetheless demonstrates the introduction of ordered N-arm self-association interfaces in conformations of P13L mutants.”

      And more completely summarized experimental evidence prior to describing the ColabFold prediction results (which previously did not include mention of the NMR):

      “Finally, confirming the interpretation of the EM images and the CD data, as well as the b-structure propensity reported from NMR data (Zachrdla et al., 2022), the structural prediction of N[10-20]:P13L in ColabFold displayed oligomers with stacking b-sheets …”

      (3) In the main text, the authors alternate between emphasizing non-covalent effects ("a major effect of the cysteines already arises in reduced conditions without any covalent bonds," line 576) and highlighting "oxidized tetrameric N-proteins of N:G214C and N:G215C can be incorporated into RNPs". Therefore, the biological relevance of disulfide redox chemistry in viral assembly in vivo remains unclear. Discussing cellular redox plausibility and whether the authors' oxidizing conditions are meant as a mechanistic stress test rather than physiological mimicry could improve the interpretation of these results.

      The paper could benefit if the authors provide a summary figure or table contrasting reduced vs. oxidized conditions for G214C/G215C mutants (self-association, oligomerization state, RNP stability). Explicitly discuss whether disulfides are likely to form in infected cells.

      We thank the Reviewer for raising this most interesting point.  The reason why the biological relevance of N dilsulfides remains unclear is simply that this is still unknown, unfortunately. Recently, Kubinski et al. have strongly argued for the formation of disulfides in infected cells, but in our view the evidence remains weak since the majority of disulfide bonds in that work presented as post-lysis artifacts, and it appears the non-covalent effects alone could explain the physiological observations. We aimed for a balanced presentation and wrote in the relevant Results section:

      “Covalent disulfide bonds in the LRS in non-reducing conditions were found to further promote LRS oligomerization. However, there is no conclusive data yet whether covalent bonds in the LRS occur in vivo, or any G215C effect is entirely non-covalent due to the significant strengthening of LRS helix oligomerization (see Discussion).”

      Despite the uncertainty regarding physiological disulfide bond formation, we believe it is useful to ask whether covalently crosslinked N dimers would aid or constrain RNP assembly in our biophysical model. We have now better explained this motivation in the Results section describing the RNP experiments:

      “Even though it is still unclear whether disulfide bonds of N cysteine mutants form in vivo, we were curious about the impact of disulfide-linked oligomers of the cysteine mutants on their RNP structure and stability in our biophysical assembly model.”

      The referenced paragraph from the Discussion reads:

      “Regarding the cysteine mutations that have been repeatedly introduced in the LRS prior to the rise of the Omicron VOCs, it is an open question whether they lead to covalent bonds in vivo or in the VLP assay. While examples of disulfide-linked viral nucleocapsid proteins have been reported (Kubinski et al., 2024; Prokudina et al., 2004; Wootton and Yoo, 2003), a methodological difficulty in their detection is artifactual disulfide bond formation post-lysis of infected cells (Kubinski et al., 2024; Wootton and Yoo, 2003).  However, our results clearly show that a major effect of the cysteines already arises in reduced conditions without any covalent bonds, through extension of the LRS helices, and concomitant redirection of the disordered N-terminal sequence. While oxidized tetrameric N-proteins of N:G214C and N:G215C can be incorporated into RNPs, the covalent bonds provided only marginally improved RNP stability.  Interestingly, the introduction of cysteines imposes preferences of RNP oligomeric states dependent on oxidation state, consistent with our MD simulations highlighting the impact of cysteine orientation of 214C versus 215C relative to the hydrophobic surface of the LRS helices. Overall, considering potentially detrimental structural constraints from covalent bonds on LRS clusters seeding RNPs, energetic penalties on RNP disassembly, as well as the required monomeric state of the LRS helix for interaction with the NSP3 Ubl domain (Bessa et al., 2022), at present it is unclear to what extent the formation of disulfide linkages between LRS helices would be beneficial or detrimental in the viral life cycle.”

      We feel that this text addresses the Reviewer’s comment, and that expanding the existing discussion further would conflict with other recommendations to shorten and focus the text.

      Finally, we have addressed the valuable suggestion of a new table summarizing the oligomeric state and self-association of the different cysteine mutants by inserting a new column in the existing Table 1 reporting all species’ oligomeric state at low micromolar concentrations. In this way they can be compared at a glance with the other mutants as well. A more detailed comparison of the concentration-dependent size-distribution is provided in Figure 4.

      (4) VLP assays (Figure 7) show little enhancement for P13L or G215C alone, whereas Figure 8 shows that P13L provides clear fitness advantages. This discrepancy is acknowledged but not reconciled with any mechanistic or systematic rationale. The authors should consider emphasizing the limitations of VLP assays and the sources of the discrepancy with respect to Figure 8.

      We thank the Reviewer for this comment, which highlights a very important point. 

      For clarification and to improve the cohesion of the manuscript we have inserted a reference to the Discussion after the presentation of the VLP results, which provides a natural transition to the following description of the reverse genetics experiments:

      “As expanded on in the Discussion, the failure to observe enhancement by P13L alone may be related to limitations of the VLP assay in sensitivity, including the restriction to a single round of infection, and protein expression levels.”

      This references a paragraph in the Discussion about the limitations of the VLP assay in general and the reasons we believe the enhancement by P13L alone was not picked up:

      “…While this assay has been widely used for rapid assessment of spike protein and N variants (Syed et al., 2021), it has limitations due to the addition of non-genomic RNA and the lack of double membrane vesicles from which gRNA emerges through the NSP3/NSP4 pore complex potentially poised for packaging (Bessa et al., 2022; Ke et al., 2024; Ni et al., 2023). It should also be recognized that the results do not directly reflect the relative efficiency of RNP assembly only, since protein expression levels, their localization, and their posttranslational modifications are not controlled for. Susceptibility for such factors might be exacerbated with mutations that modulate weak protein interactions. For example, as shown previously (Syed et al., 2024; Zhao et al., 2024), a GSK3 inhibitor inhibiting N-protein phosphorylation significantly enhances VLP formation and eliminates the advantage provided for by the N:G215C mutation relative to the ancestral N – presumably due to an increase in assembly-competent, non-phosphorylated N-protein erasing an affinity advantage. A similar process may be underlying the absent or marginal improvement in VLP readout from the cysteine LRS mutants and P13L at the achieved transfection level in the present work, and the enhanced signal from R203K/G204R and R203M (the latter being consistent with previous reports (Li et al., 2025; Syed et al., 2021)) modulating protein phosphorylation. Nonetheless, mirroring the results of the biophysical in vitro experiments, the addition of RNP-stabilizing P13L and G214C mutations on top of R203K/G204R led to a significantly larger VLP signal.

      The VLP assay may be limited in sensitivity to mutation effects due to its restriction to a single round of infection. To avoid this and other potential limitations of the VLP assay for the study of viral packaging, for the key mutation N:P13L we carried out reverse genetics experiments. These showed the sole N:P13L mutation significantly increases viral fitness (Figure 8).”

      (5) Figures 5 and 6 are dense, and the several overlays make it hard to read. The authors should consider picking the most extreme results to make a point in the main Figure 5 and move the other overlays to the Supplementary. Additionally, annotating MP peaks directly with "2×, 4×, 6× subunits" can help non-experts.

      We completely agree with the Reviewer – these figures were very dense.  To mitigate this problem without having the reader to switch back-and-forth to the supplement, we subdivided the panels of Figure 5 and showed only a subset of curves in each.  In this way the data are easier to read while still readily compared. It is a large figure, but it contains the key data for the present work and is therefore worthwhile to have in one place. For the MP histogram data we also have inserted the suggested peak labels. Similarly, we have split Figure 6A into two panels for clarity.

      (6) The paper has several names and shorthand notations for the mutants, making it hard to keep up. The authors could include a table that contains mutation keys, with each shorthand (Ancestral, Nο/No, Nλ, etc.) mapped onto exact N mutations (P13L, Δ31-33, R203K/G204R, G214C/G215C, etc.). They could then use the same glyphs (Latin vs Greek) consistently in text and figure labels.

      Yes, we agree this is a problem and we apologize for the confusion. However, it is not possible to refer exclusively to either Latin or Greek terminology, which we feel would be even more detrimental to readability (the former being exhaustively lengthy and the latter being imprecise). But we have used a rational system: If the complete set of mutations of a variant are present, then its Greek letter will be used as an abbreviation, and otherwise we use Latin amino acid/position indicators for individual mutations or combinations thereof. Unfortunately, previously we inadvertently failed to explicitly mention this, and we are most grateful for the Reviewer to point this out.

      We have now rectified this by including upfront the sentence:

      “We will adopt a nomenclature where the complete set of defining mutations of a variant will be referred to by its Greek letter, i.e., N:P13L/R203K/G204R/G214C is N<sub>­­λ</sub>, and analogously the set of Omicron mutations N:P13L/Δ31-33/R203K/G204R are referred to as N<sub>ο</sub>; see Table 1”

      This will define the two shorthands N<sub>λ</sub> and N<sub>ο</sub> used. Furthermore, as suggested and pointed to in the text, Table 1 does provide the keys to mutation and variants, including the information in which variant any of the other mutations studied here occur.

      (7) The EM fibrils (Figure 2A) and CD spectra (Figure 2B) were collected at mM peptide concentrations. These are far above physiological levels and may encourage non-specific aggregation. Similarly, the authors mention" ultra-weak binding energies that require mM concentrations to significantly populate oligomers". On the other hand, the experiments with full-length protein were performed at concentrations closer to biologically relevant concentrations in the micromolar range. While I appreciate the need to work at high concentrations to detect weak interactions, this raises questions about physiological relevance.

      This is indeed an important point to clarify. We agree that much lower nucleocapsid protein concentrations are present in the cytosol on average, and these were used in our RNP assembly experiments. However, there are at least two important physiologically relevant cases where high local N concentrations do occur:

      (1) Once assembled in RNPs, the disordered N-terminal extensions are locally at a very high concentration within the volume they can explore while tethered to the NTD. A back-of-the-envelope calculation assuming 12 N-protein subunits confining 12 N-terminal extensions to the volume of a single RNP (≈14x14x14 nm<sup>3</sup> by cryoEM; Klein et al 2020) leads to an effective concentration of 7.4 mM. Obviously the N-arm peptides are not completely free and there will be constraints that would hinder or promote encounter complex probability, but interfaces with mM Kd are clearly strong enough to populate Narm-Narm contacts extending from N-protein in the RNP.

      Additionally, any interaction where N-proteins are brought in close proximity could allow weak N-arm interactions to provide additional stability. Besides the RNP, we demonstrate this in our Results for nucleic-acid liganded N tetramers (Figure 4B), but this might similarly occur in complexes with NSP3 or host proteins. Generally, it is quite common that small additional binding energies play important roles in the modulation of multivalent protein complexes.

      (2) Within the macromolecular condensate the local concentration will be substantially higher than on average within the infected cell.  While we do not know its precise concentration, it is well-established that the sum of many ultra-weak interactions is driving the formation of this dense liquid phase. In our previous eLife paper (Nguyen et al., 2024) we have shown LLPS is suppressed with the R203K/G204R mutation, but it is ‘rescued’ with the additional P13L/del31-33 mutation of the Omicron variant showing strong LLPS. Similarly, LLPS is suppressed by the LRS mutant L222P, but rescued in conjunction with P13L. This is another biologically relevant scenario where weak interactions are critical.

      We have emphasized these points in the revised manuscript as described below.

      Specifically:

      (a) Could some of the fibril/β-sheet features attributed to P13L (Figure 2A-C) reflect non-specific aggregation at high concentrations rather than bona fide self-association motifs that could play out in biologically relevant scenarios?

      We understand this concern from the experience with proteins that often have limited solubility and tendencies to aggregate, sometimes accompanied by unfolding and driven by hydrophobic interactions, or clustering on the path to LLPS. However, we are struggling to reconcile the picture of non-specific aggregation with the context of our P13L N-arm peptides. The term ‘non-specific aggregation’ implies the idea of amorphous aggregates, which we would contend is inconsistent with the observed geometry of fibrils, which exhibit long-range order. In addition, non-specific aggregation does not lead to increased solution viscosity, which we describe, but fibril formation does. Another connotation of ‘aggregates’ is irreversibility.  However, we find the beta-sheet-like conformation seen at 1 mM becomes significantly more disordered when the same sample is diluted to 0.4 mM peptide. This is consistent with a reversible self-association driven by a conformational change toward ordered secondary structure.

      To highlight the reversibility, we have clarified the description: “Interestingly, diluting the 1 mM sample (solid) to a concentration of 0.4 mM (dashed) reveals a large shift in the far-UV spectra … both indicative of a significant increase of disorder upon dilution. This is consistent with the stabilization of b-sheets in a reversible, strongly cooperative self-association process with an effective K<sub>D</sub> in the high mM to low mM range.”

      We have also inserted a concentration conversion to mg/ml units, which shows even 1 mM of peptides is only ~5 mg/ml, i.e. not excessively high. “While the ancestral N-arm at »1 mM (» 4.6 mg/ml) concentrations exhibits CD spectra with a minimum at »200 nm typical of disordered conformations (black)”

      With regard to the question of specificity, we have studied similar N-arm peptides without P13L mutations and with the 31-33 deletion under equivalent conditions. But we observe the reversible self-association, conformational change, and fibril formation only for those containing the P13L mutation, consistent with ColabFold predictions. Neither did we observe fibrils with disordered C-arm peptides.

      How these weak self-association motifs in the N-arm can be physiologically relevant in the context of full-length protein modulating the stability of multi-molecular complexes and enhancing LLPS was outlined above, and further clarified in the manuscript as detailed below.

      (b) How do the authors justify extrapolating from the mM-range peptide behaviors to the crowded but far lower effective concentrations in cells?

      As pointed out above, the key to this question is the local preconcentration as the N-arm peptides are tethered to the rest of protein in the context of flexible multi-molecular assemblies. Another mechanism to consider is the formation of condensates. The response to the next comment will expand on this.

      The authors should consider adding a dedicated section (either in Methods or Discussion) justifying the use of high concentrations, with estimation of local concentrations in RNPs and how they compare to the in vitro ranges used here. For concentration-dependent phenomena discussed here, it is vital to ensure that the findings are not artefacts of non-physiological peptide aggregation..

      The use of high concentration in biophysical experiments is quite common, for example, in NMR or crystallography, insofar as they elucidate molecular properties. We believe this is obvious; the Reviewer will certainly agree with us, and this does not require further elaboration. The property observed in this case is the existence of specific, weak protein self-association interfaces in the N-arm.

      Our response to the Reviewer’s point 7(a) addresses the distinction between artefactual aggregation and self-association of N-arm peptides. The relevance of these weak protein self-association interfaces in the context of the full-length protein is the second underlying question.

      As we have previously stated in a dedicated Results paragraph:

      “In contrast to the modulation of the coiled-coil LRS interfaces, the de novo creation of the N-arm self-association interface through beta-sheet interactions enabled by P13L cannot be readily observed in full-length N-protein at low M concentrations. Similar to the ancestral LRS interface, it provides only ultra-weak binding energies that require mM concentrations to significantly populate oligomers. This is fully consistent with the previous observation by SV-AUC that neither N:P13L,31-33 nor N<sub>o</sub> with the full set of Omicron mutations show any significant higher-order self-association at low M concentrations, whereas at high local concentrations – as observed in phase-separated droplets – they can modulate and cooperatively enhance self-association processes (Nguyen et al., 2024). (If fact, P13L can substitute for the LRS promoting LLPS, as observed in the rescue of LLPS by N:P13L,31-33/L222P mutants whereas N:L222P LRS-abrogating mutants are deficient in LLPS.) Another process that increases the local concentration of N-arm chains is the tetramerization of full-length N-protein. As described earlier, occupancy of the NA-binding site in the NTD allosterically promotes self-assembly of the LRS into higher oligomers (Zhao et al., 2021). We hypothesized that these oligomers may be cooperatively stabilized by additional N-arm interactions in P13L mutants.”

      To state completely unambiguously why weak interfaces are important, we have followed the Reviewer’s suggestion and added an additional clarification already earlier, at the end of the P13L Results section:

      “While this self-association interface in the P13L N-arm is weak and its direct observation in biophysical experiments requires mM concentrations, which far exceed average intracellular concentration of N, such  weak interactions can become highly relevant physiologically when high local concentrations are prevailing, for example, when the disordered extension is preconcentrated while tethered within macromolecular assemblies as in the RNP, or in macromolecular condensates.”

      Furthermore, we have added early in the Discussion:

      “Even though the solution affinity of the N-arm P13L interface is ultra-weak, the average local concentration of N-arm chains across the RNP volume (in a back-of-the-envelope calculation assuming a ≈14 nm cube (Klein et al., 2020) with a dodecameric N cluster) is ≈7.4 mM, such that disordered N-arm peptides could well create populations of N-arm clusters stabilizing RNPs through this interface.  However, besides the RNP-stabilizing mutants we have also observed unexpected RNP destabilization by the ubiquitous R203K/G204R double mutation, which may be caused by the introduction of additional charges close to the self-association interface in the LRS. In our experiments, this destabilization is more than compensated for by the P13L mutation. (Another scenario where ultra-weak interactions can have a critical impact is in molecular condensates. We previously reported the suppression of LLPS by the R203K/G204R mutation, which is rescued by the additional P13L/Δ31-33 mutation (Nguyen et al., 2024). This is consistent with compensatory weak stabilizing and destabilizing impacts of weak interactions on the RNP observed here.)”

      Reviewer #1 (Recommendations for the Authors):

      In Figure 1B, it is unclear what the orange lines connecting polypeptides represent, as well as the zig-zag orange lines in the N-arm.

      We thank the Reviewer for this comment. We intended this to represent regions of self-association but recognize the patterned background is confusing. We have changed this now to solid-colored backgrounds, and indicated this in the figure legend:

      “Regions of self-association are indicated by shaded backgrounds.”

      Regarding presentation, in Figure 5 (MP), the relationship between mass and oligomer size should be shown more clearly.

      We agree. To this end we have labeled the peaks in the MP histograms in Figure 5 with the oligomeric state of the 2N/2SL7 subunits.

      Reviewer #2 (Recommendations for the Authors):

      I find the science of the paper to be convincing and compellingly supported.

      Thank you for this positive statement.

      My primary complaints are with presentation or minor technical questions that, honestly, primarily arise due to my own ignorance and unfamiliarity with some of the techniques employed.

      My primary issue is with the figures. I find, generally, the text in axes labels, ticks, and legends to be too small to comfortably read. This is particularly true in the CD spectra and

      other data presented in Figures 1D, 2B, 4, 5, 6, and 8.

      We agree and have increased the font size of all text and labels of the plots in Figure 1, 2, 4, 5, 6, and 8.

      I also found the use of initialisms to be a bit overbearing and inconsistent. For example, the authors repeatedly switch between spelling out "nucleic acid" and the initialism "NA" (which is also never explicitly spelled out in the text). With the already substantial length of the text, my own personal opinion would be to suggest spelling out all initialisms in the interest of making the reading easier.

      This is a valid criticism. To improve the readability, we have followed this advice and systematically spelled out “nucleic acid” instead of using “NA”.  Similarly, we have now written out full-length instead of the abbreviation FL, and omitted the abbreviation IDR for intrinsically disordered regions, as well as VOC for variant of concern, and AF3 for AlphaFold.

      Regarding the reference to mutants, we have now explained upfront the system of Latin and Greek nomenclature we consistently applied.

      “We will adopt a nomenclature where the complete set of defining mutations of a variant will be referred to by its Greek letter, i.e., N:P13L/R203K/G204R/G214C is N­­<sub>l</sub>, and analogously the set of Omicron mutations N:P13L/Δ31-33/R203K/G204R are referred to as N<sub>ο</sub>; see Table 1”

      I found the text to be verbose, bordering on overly so; the Introduction is more than two pages long. The section "Enhanced oligomerization of the leucine-rich sequence through cysteine mutations" has two long paragraphs of introduction before the present results are discussed, et cetera. An (admittedly, very rough) estimation of the length of the paper places it at ~9,000 -10,000 words long, and I think that the presentation might benefit from significant editing and

      shortening.

      We agree the manuscript is longer than would be desirable, and we generally prefer not to insert mini-introductions into Results sections. On the other hand, in order to make a solid contribution to understanding the big picture of fuzzy complexes in molecular evolution of RNA virus proteins it is indispensable to go into the details of RNP assembly and several of the interfaces. Therefore, we feel the length is in the range that it needs to be without losing clarity. In addition, other Reviewer suggestions to extend the discussion, for example, of limitations of VLP assays and the in vivo state of cysteines, conflict with significant shortening.

      In the particular case of the cysteine mutations, cited by the Reviewer, we believe it is important to add detailed background on G215C, because the Results proceed in a comparison of the self-association mode between G215C and G214C. This is of significant interest in the present context not only for the independent introduction of interface-enhancing mutations highlighting the evolution of fuzzy complexes, but also because it illustrates the pleomorphic ability of RNPs.

      Nonetheless, we have slightly shortened this text and merged the background into a single paragraph. More generally, we have critically reread the text to remove tangential sentences where possible and to make it more concise.

      I have a few more specific comments.

      In Figure 1A, I suggest explicitly labeling the location of the LRS, as it comes up repeatedly.

      Yes, we thank the Reviewer for this suggestion and have introduced this label in Figure 1A.

      In Figure 1B, the legend indicates that the red lines indicate "new inter-dimer interactions." However, these red lines are overlayed on a vertical stripe of red squiggles; it is unclear to me and not explicitly described in the legend what these squiggles are meant to illustrate.

      We agree this background was confusing. As mentioned in our Response to Reviewer #1 we have replaced the structured background with a solid background and explained in the figure legend that these areas depict regions of self-association.

      On lines 44-45, the authors state, "The IDRs amount to 45%, ..." 45% of what?

      Thank you, this was unclear.  We have now clarified “The IDRs amount to ≈45% of total residues”

      In lines 244 - 246, the authors compare the sizes of complexes in reducing versus non- reducing conditions as measured by dynamic light scattering, stating, "However, dynamic light scattering (DLS) revealed the presence of N210-246:G214C complexes with hydrodynamic radii 244 ranging from 6 to 40 nm (in comparison to 1-2 nm for N210- 246:G215C(Zhao et al., 2022)) in reducing conditions, and slightly larger in non-reducing conditions (Supplementary Figure S4)." Using this single statistic seems to me to be a less-than-ideal way of characterizing what seems to me to be happening here. In Supplementary Figure 4, it appears to me that what is happening is that in non-reduced conditions, the sample is monodisperse, whereas in reducing conditions, the distribution becomes polydisperse/bimodal, with two clearly separate populations. I feel that this could use a more

      thorough description rather than just stating the overall range of particle sizes.

      Yes, the Reviewer is correct – it is indeed a good idea to be more precise here. To this end we have carried out cumulant analyses on the autocorrelation functions, as a time-honored method to quantify the polydispersity.  Both samples are polydisperse, but more so in reducing conditions. We have now added “For N210-246:G214C a cumulant analysis results in radii of 8.8 nm and 10.6 nm and polydispersity indices of 0.40 and 0.35 for reducing and non-reducing conditions, respectively”

      Finally, I have one remaining comment that is a result of my own inexperience with circular dichroism and interpreting the spectra. For me personally, I would appreciate a more thoroughdescription/illustration of the statistics involved in the CD spectra, but perhaps this is not necessary for people who are more familiar with interpreting these kinds of data. For example, in Figure 1D, it is not clear to me what the error bars/confidence intervals for the CD data look like. I see many squiggles, some of which the authors claim are significant (e.g., the differences between ~215 - 230 nm), and others are not worthy of comment. Let's say, for example, that I fit a smoothed spline through these data and then measure the magnitude of the fluctuations from that spline to define/quantify confidence intervals. What does that distribution look like? Or maybe the confidence intervals are so small that all squiggles are significant?

      Thank you, this is a good question. As mentioned in the methods section, the CD spectra shown are averages of triplicate scans. Therefore, it is straightforward to extract the standard deviation at each wavelength from the three measurements (although a spline would probably work just as well). The values are what one would expect for the squiggles to be random noise. In the region 215 – 220 nm characteristic for helical secondary structure the standard deviations are small relative to the separation between curves, which indicates that the differences are highly significant. Naturally, the curves do overlap in other spectral regions, which would make a plot including the wavelength-dependent error bars or confidence bands too crowded. Therefore, we have kept the plot of the averaged triplicate scans, but have now provided the average standard deviations for all species in the figure legend and mentioned their significant separation:

      “Triplicate scans yield average standard deviations of 0.13 (N), 0.17 (N+SL7), 0.16 (N<sub>l</sub>), and 0.21 (N<sub>l</sub> +SL7) 10<sup>3</sup> deg cm<sup>2</sup>/dmol, respectively, with non-overlapping confidence bands for the different species, for example, between 215-220 nm.”

      Reviewer #3 (Recommendations for the Authors):

      (1) The Discussion reiterates much of the background (mutational tolerance, fuzziness, SLiMs) already covered in the Introduction, diluting focus on the key new findings. The authors should consider shortening and refocusing the discussion on the main contributions in light of existing knowledge of viral assembly.

      In the Introduction we have provided background on intrinsically disordered proteins in general and their mutational tolerance, as well as the concept of fuzzy complexes. The first several paragraphs of the Discussion have a different focus, which is protein binding interfaces between viral proteins (obviously key in fuzzy complexes), specifically their modulation and the remarkable de novo introduction of binding interfaces. We believe this deserves emphasis, since this highlights a novel aspect of fuzziness, for the mutant spectrum of RNA viruses to encode a range and of assembly stabilities and architectures. 

      To reduce redundancy between the end of the Introduction and the beginning of the Discussion, we have shortened the last paragraph of the Introduction and removed its preview of the conclusions, as described in the response to the next comment of the Reviewer (see below).

      Unfortunately, the length of the Discussion is dictated in part also by the need to discuss methodological aspects, among them the limitations of VLP assays, and the redox state of the cysteine in the LRS mutants, which were important points recommended by other suggestions of the Reviewers. Similarly, we believe the discussion of other potential functions of Omicron N-arm mutations is warranted, as well as the background of the R203K/G204R double mutation that has attracted significant attention in the field due to its effects on phosphorylation and expression of truncated N species that also form RNPs. Our goal was to integrate the results by us and other laboratories regarding specific mutation effects into a comprehensive picture of molecular evolution of N, which we believe the framework of fuzzy complexes can provide.

      (2) The Abstract and early Introduction set a broad stage (IDPs, fuzziness), but don't explicitly state the concrete hypotheses that the experiments test. Please add 2-3 sentences in the Introduction that enumerate testable hypotheses, e.g.:

      (a) P13L creates a new N-arm interface that increases RNP stability.

      (b) G214C/G215C strengthens LRS oligomerization to stabilize higher-order N assemblies.

      We agree the introduction can be improved.  However, it seems to us that it cannot be neatly framed in the hypothesis – answer dichotomy, without losing a lot of nuances and without requiring an even longer and more detailed introduction.

      One of the main questions is to test whether the framework of fuzzy complexes can be applied to understand molecular evolution of N, and we feel the introduction is already flowing well towards this:

      “ … In fuzzy complexes the total binding energy is distributed into multiple distinct ultra-weak interaction sites (Olsen et al., 2017). Similar to individual RNA virus proteins with loose or absent structure, maintaining disorder and a spatial distribution of low-energy interactions in the protein complexes may increase the tolerance for mutations and improve evolvability of protein complexes.\

      The unprecedented worldwide sequencing effort of SARS-CoV-2 genomes during its rapid evolution in humans provides a unique opportunity to examine these concepts. ...”

      To bring this to a more concrete set of questions in the end, we have shortened and rewritten the last paragraph in the Introduction:

      “To examine how architecture and energetics of RNP assemblies can be impacted by N-protein mutations we study a panel of N-proteins derived from ancestral Wuhan-Hu-1 and different VOCs, including Alpha, Delta, Lambda, and Omicron (see Table 1), in biophysical experiments, VLP assays, and mutant virus. Specifically, we ask how the RNP size distribution and life-time is modulated by: (1) the novel binding interface created by the P13L mutation of Omicron; (2) enhancements of other weak self-association interfaces through G215C of Delta and G214C of Lambda; (3) the ubiquitous R203K/G204R double mutation of Alpha, Lambda, and Omicron.  We also test whether the P13L mutation improves viral fitness, similar to G215C and R203K/G204R. The results are discussed in the framework of fuzzy complexes and molecular evolution of N in the course of viral adaptation to the human host. Understanding the salient features of the binding interfaces in viral assembly and their evolution expands our foundation for the design of therapeutics such as assembly inhibitors.”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife Assessment:

      Glioblastoma is one of the most aggressive cancers without a cure. Glioblastoma cells are known to have high mitochondrial potential. This useful study demonstrates the critical role of the ribosome-associated quality control (RQC) pathway in regulating mitochondrial membrane potential and glioblastoma growth. Some assays are incomplete; further revision will improve the significance of this study.

      For clarity, we propose revising the second sentence to: "It is well-established that certain cancer cells, such as glioblastoma cells, exhibit elevated mitochondrial membrane potential."

      Reviewer #1 (Public Review):

      Summary:

      Cai et al have investigated the role of msiCAT-tailed mitochondrial proteins that frequently exist in glioblastoma stem cells. Overexpression of msiCAT-tailed mitochondrial ATP synthase F1 subunit alpha (ATP5) protein increases the mitochondrial membrane potential and blocks mitochondrial permeability transition pore formation/opening. These changes in mitochondrial properties provide resistance to staurosporine (STS)-induced apoptosis in GBM cells. Therefore, msiCAT-tailing can promote cell survival and migration, while genetic and pharmacological inhibition of msiCAT-tailing can prevent the overgrowth of GBM cells.

      Strengths:

      The CAT-tailing concept has not been explored in cancer settings. Therefore, the present provides new insights for widening the therapeutic avenue. 

      Your acknowledgment of our study's pioneering elements is greatly appreciated.

      Weaknesses:

      Although the paper does have strengths in principle, the weaknesses of the paper are that these strengths are not directly demonstrated. The conclusions of this paper are mostly well-supported by data, but some aspects of image acquisition and data analysis need to be clarified and extended.

      We are grateful for your acknowledgment of our study’s innovative approach and its possible influence on cancer therapy. We sincerely appreciate your valuable feedback. In response, this updated manuscript presents substantial new findings that reinforce our central argument. Moreover, we have broadened our data analysis and interpretation, as well as refined our methodological descriptions.

      Reviewer #2 (Public Review):

      This work explores the connection between glioblastoma, mito-RQC, and msiCAT-tailing. They build upon previous work concluding that ATP5alpha is CAT-tailed and explore how CAT-tailing may affect cell physiology and sensitivity to chemotherapy. The authors conclude that when ATP5alpha is CAT-tailed, it either incorporates into the proton pump or aggregates and that these events dysregulate MPTP opening and mitochondrial membrane potential and that this regulates drug sensitivity. This work includes several intriguing and novel observations connecting cell physiology, RQC, and drug sensitivity. This is also the first time this reviewer has seen an investigation of how a CAT tail may specifically affect the function of a protein. However, some of the conclusions in this work are not well supported. This significantly weakens the work but can be addressed through further experiments or by weakening the text.

      We appreciate the recognition of our study's novelty. To address your concerns about our conclusions, we have revised the manuscript. This revision includes new data and corrections of identified issues. Our detailed responses to your specific points are outlined below.

      Reviewer #1 (Recommendations For The Authors):

      (1) In Figure 1B, please replace the high-exposure blots of ATP5 and COX with representative results. The current results are difficult to interpret clearly. Additionally, it would be helpful if the author could explain the nature of the two different bands in NEMF and ANKZF1. Did the authors also examine other RQC factors and mitochondrial ETC proteins? I'm also curious to understand why CAT-tailing is specific to C-I30, ATP5, and COX-V, and why the authors did not show the significance of COX-V.

      We appreciate your inquiry regarding the data.  Additional attempts were made using new patient-derived samples; however, these results did not improve upon the existing ATP5⍺, (NDUS3)C-I30, and COX4 signals presented in the figure.  This is possibly due to the fact that CAT-tail modified mitochondrial proteins represent only a small fraction of the total proteins in these cells.  It is acknowledged that the small tails visible above the prominent main bands are not particularly distinct. To address this, the revised version includes updated images to better illustrate the differences. We believe the assertion that GBM/GSCs possess CAT-tailed proteins is substantiated by a combination of subsequent experimental findings. The figure (refer to new Fig. 1B) serves primarily as an introduction. It is important to note that the CAT-tailed ATP5⍺ plays a vital role in modulating mitochondrial potential and glioma phenotypes, a function which has been demonstrated through subsequent experiments.

      It is acknowledged that the CAT-tail modification is not exclusive to the ATP5⍺protein.  ATP5⍺ was selected as the primary focus of this study due to its prevalence in mitochondria and its specific involvement in cancer development, as noted by Chang YW et al.  Future research will explore the possibility of CAT tails on other mitochondrial ETC proteins. Currently, NDUS3 (C-I30), ATP5⍺, and COX4 serve as examples confirming the existence of these modifications. It remains challenging to detect endogenous CAT-tailing, and bulk proteomics is not yet feasible for this purpose. COX4 is considered significant.  We hypothesize that CAT-tailed COX4 may function similarly to the previously studied C-I30 (Wu Z, et al), potentially causing substantial mitochondrial proteostasis stress.  

      Concerning RQC proteins, our blotting analysis of GBM cell lines now includes additional RQC-related factors. The primary, more prominent bands (indicated by arrowheads) are, in our assessment, the intended bands for NEMF and ANKZF1.  Subsequent blotting analyses showed only single bands for both ANKZF1 and NEMF, respectively. The additional, larger molecular weight band of NEMF, which was initially considered for property analysis (phosphorylation, ubiquitination, etc.), was not examined further as it did not appear in subsequent experiments (refer to new Fig. S1C).

      References:

      Chang YW, et al. Spatial and temporal dynamics of ATP synthase from mitochondria toward the cell surface. Communications biology. 2023;6(1).

      Wu Z, et al. MISTERMINATE Mechanistically Links Mitochondrial Dysfunction With Proteostasis Failure. Molecular cell. 2019;75(4).

      (2) In addition to Figure 1B, it would be interesting to explore CAT-tailed mETC proteins in cancer tissue samples.

      This is an excellent point, and we appreciate the question. We conducted staining for ATP5⍺ and key RQC proteins in both tumor and normal mouse tissues. Notably, ATP5⍺ in GBM exhibited a greater tendency to form clustered punctate patterns compared to normal brain tissue, and not all of it co-localized with the mitochondrial marker TOM20 (refer to new Fig. S3C-E). Crucially, we observed a significant increase in NEMF expression within mouse xenograft tumor tissues, alongside a decrease in ANKZF1 expression (refer to new Fig. S1A, B). These findings align with our observations in human samples.

      (3) Please knock down ATP5 in the patient's cells and check whether both the upper band and lower band of ATP5 have disappeared or not.

      This control was essential and has been executed now. To validate the antibody's specificity, siRNA knockdown was performed. The simultaneous elimination of both upper and lower bands upon siRNA treatment (refer to new Fig. S2A) confirms they represent genuine signals recognized by the antibody.

      (4) In Figure 1C and ID, add long exposure to spot aggregation and oligomer. Figure 1D, please add the blots where control and ATP5 are also shown in NHA and SF (similar to SVG and GSC827).

      New data are included in the revised manuscript to address the queries. Specifically, the new Fig 1D now displays the full queue as requested, featuring blots for Control, ATP5α, AT3, and AT20. Our analysis reveals that AT20 aggregates exhibit higher expression and accumulation rates in GSC and SF cells.

      Fig. 1C has been updated to include experimental groups treated with cycloheximide and sgNEMF. Our results show that sgNEMF effectively inhibits CAT-tailing in GBM cell lines, whereas cycloheximide has no impact. After consulting with the Reporter's original creator and optimizing expression conditions, we observed no significant aggregates with β-globin-non-stop protein, potentially due to the length of endogenous CAT-tail formation (as noted by Inada, 2020, in Cell Reports). Our analysis focused on the ratio of CAT-tailed (red box blots) and non-CAT-tailed proteins (green box blots). Comparing these ratios revealed that both anisomycin treatment and sgNEMF effectively hinder the CAT-tailing process, while cycloheximide has no effect.

      (5) In Figure 1E, please double-check the results with the figure legend. ATP5A aggregated should be shown endogenously. The number of aggregates shown in the bar graph is not represented in micrographs. Please replace the images. For Figure 1E, to confirm the ATP5-specific aggregates, it would be better if the authors would show endogenous immunostaining of C-130 and Cox-IV.

      Labels in Fig. 1E were corrected to reflect that the bar graph in Fig. 1F indicates the number of cells with aggregates, not the quantity of aggregates per cell. The presence

      (6) Figure 3A. Please add representative images in the anisomycin sections. It is difficult to address the difference.

      We appreciate your feedback. Upon re-examining the Calcein fluorescence intensity data in Fig. 3A, we believe the images accurately represent the statistical variations presented in Fig. 3B. To address your concerns more effectively, please specify which signals in Fig. 3A you find potentially misleading. We are prepared to revise or substitute those images accordingly.

      (7) Figure 3D. If NEMF is overexpressed, is the CAT-tailing of ATP 5 reversed?

      Thank you. Your prediction aligns with our findings. We've added data to the revised Fig. S6A, B, which demonstrates that both NEMF overexpression and ANKZF1 knockdown lead to elevated levels of CRC. This increase, however, was not statistically significant in GSC cells. A plausible explanation for this discrepancy is that the MPTP of GSC cells is already closed, thus any additional increase in CAT-tailing activity does not result in further amplification.

      (8) Figure 3G. Why on the BN page are AT20 aggregates not the same as shown in Figure 2E?

      We appreciate your inquiry regarding the ATP5⍺ blots, specifically those in the original Fig. 3G (left) and 2E (right). Careful observation of the ATP5⍺ band placement in these figures reveals a high degree of similarity. Notably, there are aggregates present at the top, and the diffuse signals extend downwards. Given that this is a gradient polyacrylamide native PAGE, the concentration diminishes towards the top. Consequently, the non-rigid nature of the Blue Native PAGE gel may lead to slight variations in the aggregate signals; however, the overall patterns are very much alike. To mitigate potential misinterpretations, we have rearranged the blot order in the new Fig. 3M.

      (9) Figure 4D. The amount of aggregation mediated by AT20 is more compared to AT3. Why are there no such drastic effects observed between AT3 and AT20 in the Tunnel assay?

      The previous Figure 4D presents the quantification of cell migration from the experiment depicted in Figure 4C. But this is a good point. TUNEL staining results are directly influenced by mitochondrial membrane potential and the state of mitochondrial permeability transition pores

      (MPTP), not by the degree of protein aggregation. Our previous experiments showed comparable effects of AT3 and AT20 on mitochondria (Fig. 2E, 3K), which aligns with the expected similar outcomes on TUNEL staining. As for its biological nature, this could be very complicated. We hope to explore it in future studies.

      (10) Figure 5C: The role of NEMF and ANKZF1 can be further clarified by conducting Annexin-PI assays using FACS. The inclusion of these additional data points will provide more robust evidence for CAT-tailing's role in cancer cells.

      In response to your suggestion, we have incorporated additional data into the revised version.Using the Annexin-PI kit, we labeled apoptotic cells and detected them using flow cytometry (FACS). Our findings indicate that anisomycin pretreatment, NEMF knockdown (sgNEMF), and ANZKF1 upregulation (oeANKZF1) significantly increase the rate of STS-induced apoptosis compared to the control group (refer to new Fig. S9D-G).

      (11) Figure 5F: STS is a known apoptosis inhibitor. Why it is not showing PARP cleavage? Also, cell death analysis would be more pronounced, if it could be shown at a later time point. What is the STS and Anisomycin at 24h or 48h time-point? Since PARP is cleaved, it would also be better if the authors could include caspase blots.

      I guess what you meant to say here is "Staurosporine is a protein kinase inhibitor that can induce apoptosis in multiple mammalian cell lines." Our study observed PARP cleavage even in GSCs, which are typically more resistant to staurosporine-induced apoptosis (C-PARP in Fig. S9B). The ratio of C-PARP to total PARP increased. We selected a 180-minute treatment duration because longer treatments with STS + anisomycin led to a late stage of apoptosis and non-specific protein degradation (e.g., at 24 or 48 hours), making PARP comparisons less meaningful. Following your suggestion, we also examined caspase 3/7 activity in GSC cells treated with DMSO, CHX, and anisomycin. We found that anisomycin treatment also activated caspases (Fig. S9A).

      (12) In Figure 5, the addition of an explanation, how CAT-tailing can induce cell death, would add more information such as BAX-BCL2 ratio, and cytochrome-c release from the mitochondria.

      Thank you for your suggestion. In this study, we state that specific CAT-tails inhibit GSC cell death/apoptosis rather than inducing it. Therefore, we do not expect that examining BAX-BCL2 and mitochondrial cytochrome c release would offer additional insights.

      (13) To confirm the STS resistance, it would be better if the author could do the experiments in the STS-resistant cell line and then perform the Anisomycin experiments.

      Thank you. We should emphasize that our data primarily originates from GSC cells. These cells already exhibit STS-resistance when compared to the control cells (Fig. S8A-C).

      (14) It would be more advantageous if the author could show ATP5 CATailed status under standard chemotherapy conditions in either cell lines or in vivo conditions.

      This is an interesting question. It's worth exploring this question; however, GSC cells exhibit strong resistance to standard chemotherapy treatments like temozolomide (TMZ).

      Additionally, we couldn't detect changes in CAT-tailed ATP5⍺ and thus did not include that data.

      (15) In vivo (cancer mouse model or cancer fly model) data will add more weight to the story.

      We appreciate your intriguing question. An effective approach would be to test the RQC pathway's function using the Drosophila Notch overexpression-induced brain tumor model. However, Khaket et al. have conducted similar studies, stating, "The RNAi of Clbn, VCP, and Listerin (Ltn), homologs of key components of the yeast RQC machinery, all attenuated NSC over-proliferation induced by Notch OE (Figs. 5A and S5A–D, G)." This data supports our theory, and we have incorporated it into the Discussion. While the mouse model more closely resembles the clinical setting, it is not covered by our current IACUC proposal. We intend to verify this hypothesis in a future study.

      Reference:

      Khaket TP, Rimal S, Wang X, Bhurtel S, Wu YC, Lu B. Ribosome stalling during c-myc translation presents actionable cancer cell vulnerability. PNAS Nexus. 2024 Aug 13;3(8):pgae321.

      Reviewer #2 (Recommendations For The Authors):

      Figure 1B, C: To demonstrate that Globin, ATP5alpha, and C-130 are CAT-tailed, it is necessary to show that the high mobility band disappears after NEMF deletion or mutagenesis of the NFACT domain of NEMF. This can be done in a cell line. The anisomycin experiment is not convincing because the intensity of the bands drops and because no control is done to show that the effects are not due to translation inhibition (e.g. cycloheximide, which inhibits translation but not CAT tailing). Establishing ATP5alpha as a bonafide RQC substrate and CAT-tailed protein is critical to the relevance of the rest of the paper.

      Thank you for suggesting this crucial control experiment. To confirm the observed signal is indeed a bona fide CAT-tail, it's essential to demonstrate that NEMF is necessary for the CAT-tailing process. We have incorporated data from NEMF knockdown (sgNEMF) and cycloheximide treatment into the revised manuscript. Our findings show that both sgNEMF and anisomycin treatment effectively inhibit the formation of CAT-tailing signals on the reporter protein (Fig. 1C). Similarly, NEMF knockdown in a GSC cell line also effectively eliminated CAT-tails on overexpressed ATP5⍺ (Fig. S2B).

      In general, the text should be weakened to reflect that conclusions were largely gleaned from artificial CAT tails made of AT repeats rather than endogenously CAT-tailed ATP5alpha. CAT tails could have other sequences or be made of pure alanine, as has been suggested by some studies.

      Thank you for your reminder. We have reviewed the recent studies by Khan et al. and Chang et al., and we found their analysis of CAT tail components to be highly insightful. We concur with your suggestion regarding the design of the CAT tail sequence. We aimed to design a tail that maintained stability and resisted rapid degradation, regardless of its length. In the revised version, we clarify that our conclusions are based on artificial CAT tails, specifically those composed of AT repeat sequences (p. 9). We acknowledge that the presence of other sequence components may lead to different outcomes (p. 19).

      Reference:

      Khan D, Vinayak AA, Sitron CS, Brandman O. Mechanochemical forces regulate the composition and fate of stalled nascent chains. bioRxiv [Preprint]. 2024 Oct 14:2024.08.02.606406. Chang WD, Yoon MJ, Yeo KH, Choe YJ. Threonine-rich carboxyl-terminal extension drives aggregation of stalled polypeptides. Mol Cell. 2024 Nov 21;84(22):4334-4349.e7. 

      Throughout the work (e.g. 3B, C), anisomycin effects should be compared to those with cycloheximide to observe if the effects are specific to a CAT tail inhibitor rather than a translation inhibitor.

      We agree that including cycloheximide control experiments is crucial. The revised version now incorporates new data, as depicted in Fig. S5A, B, illustrating alterations in the on/off state of MPTP following cycloheximide treatment. Furthermore, Fig. S6A, B present changes in Calcium Retention Capacity (CRC) under cycloheximide treatment. The consistency of results across these experiments, despite cycloheximide treatment, suggests that anisomycin's role is specifically as a CAT tail inhibitor, rather than a translation inhibitor.

      Line 110, it is unclear what "short-tailed ATP5" is. Do you mean ATP5alpha-AT3? If so this needs to be introduced properly. Line 132: should say "may indicate accumulation of CAT-tailed protein" rather than "imply".

      We acknowledge your points. We have clarified that the "short-tailed ATP5α" refers to ATP5α-AT3 and incorporated the requested changes into the revised manuscript.

      Figure 1C: how big are those potential CAT-tails (need to be verified as mentioned earlier)?They look gigantic. Include a ladder.

      In the revised Fig. 1D, molecular weight markers have been included to denote signal sizes. The aggregates in the previous Fig. 1C, also present in the control plasmid, are likely a result of signal overexposure. The CAT-tailed protein is observed just above the intended band in these blots. These aggregates have been re-presented in the updated figures, and their signal intensities quantified.

      Line 170: "indicating that GBM cells have more capability to deal with protein aggregation". This logic is unclear. Please explain.

      We appreciate your question and have thoroughly re-evaluated our conclusion. We offer several potential explanations for the data presented in Fig. 1D: (1) ATP5α-AT20 may demonstrate superior stability. (2) GSC (GBM) cells might lack adequate mechanisms to monitor protein accumulation. (3) GSC (GBM) cells could possess an increased adaptive capacity to the toxicity arising from protein accumulation. This discussion has been incorporated into the revised manuscript (lines 166-169).

      Line 177: how do you know the endogenous ATP5alpha forms aggregates due to CAT-tailing? Need to measure in a NEMF hypomorph.

      We understand your concern and have addressed it. Revised Fig. 3G, H demonstrates that a reduction in NEMF levels, achieved through sgNEMF in GSC cells, significantly diminishes ATP5α aggregation. This, in conjunction with the Anisomycin treatment data presented in revised Fig. 3E, F, confirms the substantial impact of the CAT-tailing process on this aggregation.

      Line 218: really need a cycloheximide or NEMF hypomorph control to show this specific to CAT-tailing.

      We have revised the manuscript to include data from sgNEMF and cycloheximide treatments, specifically Fig. 3G, H, and Fig. S5C, D, as detailed in our response above.

      Lines 249,266, Figure 5A: The mentioned experiments would benefit from controls including an extension of ATP5alpha that was not alanine and threonine, perhaps a gly-ser linker, as well as an NEMF hypomorph.

      We sincerely appreciate your insightful comments. In response, the revised manuscript now incorporates control data for ATP5α featuring a poly-glycine-serine (GS) tail. This data is specifically presented in Figs. S2E-G, S4E, S7A, D, E, and S8F, G. Our experimental findings consistently demonstrate that the overexpression of ATP5α, when modified with GS tails, had no discernible impact on protein aggregation, mitochondrial membrane potential, GSC cell mobility, or any other indicators assessed in our study.

      Figure S5A should be part of the main figures and not in the supplement.

      This has been moved to the main figure (Fig. 5C).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):  

      From my reading, this study aimed to achieve two things:  

      (1) A neurally-informed account of how Pieron's and Fechner's laws can apply in concert at distinct processing levels.  

      (2) A comprehensive map in time and space of all neural events intervening between stimulus and response in an immediately-reported perceptual decision.  

      I believe that the authors achieved the first point, mainly owing to a clever contrast comparison paradigm, but with good help also from a new topographic parsing algorithm they created. With this, they found that the time intervening between an early initial sensory evoked potential and an "N2" type process associated with launching the decision process varies inversely with contrast according to Pieron's law. Meanwhile, the interval from that second event up to a neural event peaking just before response increases with contrast, fitting Fechner's law, and a very nice finding is that a diffusion model whose drift rates are scaled by Fechner's law, fit to RT, predicts the observed proportion of correct responses very well. These are all strengths of the study.   

      We thank the reviewer for their comments that added context to the events we detected in relation to previous findings. We also believe that the change in the HMP algorithm suggested by the reviewer improved the precision of our analyses and the manuscript. We respond to the reviewer’s specific comments below.

      (1) The second, generally stated aim above is, in the opinion of this reviewer, unconvincing and ill-defined. Presumably, the full sequence of neural events is massively task-dependent, and surely it is more in number than just three. Even the sensory evoked potential typically observed for average ERPs, even for passive viewing, would include a series of 3 or more components - C1, P1, N1, etc. So are some events being missed? Perhaps the authors are identifying key events that impressively demarcate Pieron- and Fechner-adherent sections of the RT, but they might want to temper the claim that they are finding ALL events. In addition, the propensity for topographic parsing algorithms to potentially lump together distinct processes that partially co-evolve should be acknowledged.  

      We agree with the reviewer that the topographical solutions found by HMP will be dependent on the task and the quality and type of data. We address this point in the last section of the discussion (see also response to R3.5). We would also like to add that the events detected by HMP are, by construction, those that contribute to the RT and not necessarily all ERPs elicited by a stimulus.

      In addition to the new last section of the discussion we also make these points clear in the revised manuscript at the discussion start: 

      “By modeling the recorded single-trial EEG signal between stimulus onset and response as a sequence of multivariate events with varying by-trial peak times, we  aimed to detect recurrent events that contribute to the duration of the reaction time in the present perceptual decision-making task”.

      Regarding the typical visual ERPs, in response to this comment but also comments R1.2, R1.3 and R2.1, we aimed for a more precise description of the topographies and thus reduced the width of the HMP expected events to 25ms. This ensures that we do not miss events shorter than the initial expectations of 50ms (see Appendix B of Weindel et al., 2024 and also response to  R1.3). This new estimation provides evidence for at least two of the visual ERPs that, based on their timings and topographies (in relation with the spatial frequency of the stimulus), we interpret as the N40 and the P100 (see response to R1.5 for the justification of this categorization). We provide a description and justification of the interpretations in the result section “Five trial-recurrent sequential events occur in the EEG during decisions” and the discussion section “Visual encoding time”.

      (2) To take a salient example, the last neural event seems to blend the centroparietal positivity with a more frontal midline negativity, some of which would capture the CNV and some motor-execution related components that are more tightly time-locked to, of course, the response. If the authors plotted the traditional single-electrode ERP at the frontal focus and centroparietal focus separately, they are likely to see very different dynamics and contrast- and SAT-dependency. What does this mean for the validity of the multivariate method? If two or more components are being lumped into one neural event, wouldn't it mean that properties of one (e.g., frontal burstiness at response) are being misattributed to the other (centroparietal signal that also peaks but less sharply at response)?

      Using the new HMP parameterization described above we show that the reviewer's intuition was correct. Using an expected pattern duration of 25ms the last event in the original manuscript splits in two events. The before-last event, now referred to the lateralized readiness potential (LRP) presents a strong lateralization (Figure 3) with an increased negativity over the motor cortex contralateral to the right hand. The effect of contrast is mostly on the last event that we interpret as the CPP (Figure 5). Despite the improved precision of the topographies of the identified events, it is however to be noted that some components will overlap. If the LRP is generated when a certain amount of evidence is accumulated (e.g. that the CPP crosses a certain value) then a time-based topography will necessarily include that CPP activity in addition to the lateralized potential. We discuss this in the section “Motor execution” of the discussion:

      “Adding the abrupt onset of this potential, we believe that this event is the start of motor execution, engaged after a certain amount of evidence. The evidence for this interpretation is manifest in the fact that the event's topography shares some activity with the CPP event that follows, an expected result if the LRP is triggered at a certain amount of evidence, indexed by the CPP”.

      (3) Also related to the method, why must the neural events all be 50 ms wide, and what happens if that is changed? Is it realistic that these neural events would be the same duration on every trial, even if their duration was a free parameter? This might be reasonable for sensory and motor components, but unlikely for cognitive.  

      The HMP method is sensitive to the event's duration as shown in the manuscript about the method (Appendix B of Weindel et al., 2024). Nevertheless as long as the topography in the real data is longer than the expected one it shouldn't be missed (i.e. same goes for by-trial variations in the event width). For this reason we halved the expected event width of 50ms (introduced by the original HsMM-MVPA paper by Anderson and colleagues) in the revision. This new estimation with 25ms thus is much less likely to miss events as evidenced by the new visual and motor events. In the revised manuscript this is addressed at the start of the Results section:

      “Contrary to previous applications (Anderson et al.,2016; Berberyan et al., 2021; Zhang et al., 2018; Krause et al., 2024) we assumed that the multivariate pattern was represented by a 25ms half-sine as our previous research showed that a shorter expected pattern width increases the likelihood of detecting cognitive events (see Appendix B of Weindel et al., 2024)”.

      Regarding the event width as a free parameter this is both technically and statistically difficult to implement as the amount of computing capacity, flexibility and trade-offs among the HMP parameters would, given the current implementation, render the model unfit for most computers and statistically unidentifiable.

      (4) In general, I wonder about the analytic advantage of the parsing method - the paradigm itself is so well-designed that the story may be clear from standard average event-related potential analysis, and this might sidestep the doubts around whether the algorithm is correctly parsing all neural events.  

      Average ERP analysis suffers from an impossibility to differentiate between an effect of an experimental factor on the amplitude vs. on the timing of the underlying components (Luck, 2005). Furthermore the overlap of components across trials bluries the distinction between them. For both reasons we would not be able to reach the same level of certainty and precision using ERP analyses. Furthermore the relatively low number of trials per experimental cell (contrast level X SAT X participant = 6 trials) makes the analyses hard to perform on ERP which typically require more trials per modality. From the reviewer’s comment we understand that this point was not clear. We therefore discuss this in the revision, Section “Functional interpretation of the events” of the results:

      “Nevertheless identifying neural dynamics on these ERPs centered on stimulus is complicated by the time variation of the underlying single-trial events (see probabilities displayed in Figure 3 for an illustration and Burle et al., 2008, for a discussion). The likely impact of contrast on both amplitude and time on the underlying single-trial event does not allow one to interpret the average ERP traces as showing an effect in one or the other dimension without strong assumptions (Luck, 2005)”.

      (5) In particular, would the authors consider plotting CPP waveforms in the traditional way, across contrast levels? The elegant design is such that the C1 component (which has similar topography) will show up negative and early, giving way to the CPP, and these two components will show opposite amplitude variations (not just temporal intervals as is this paper's main focus), because the brighter the two gratings, the stronger the aggregate early sensory response but the weaker the decision evidence due to Fechner. I believe this would provide a simple, helpful corroborating analysis to back up the main functional interpretation in the paper.  

      We agree with the suggestion and have introduced the representation on top of Figure 5 for sets of three electrodes in the occipital, posterior and frontal regions. The new panels clearly show an inversion of the contrast effect dependent on the time and locus of the electrodes. We discuss this in Section “Functional interpretation of the events” of the results:

      “This representation shows that there is an inversion of the contrast effect with higher contrasts having a higher amplitude on the electrodes associated with visual potentials in the first couple of deciseconds (left panel of Figure 5A) while parietal and frontal electrodes shows a higher amplitude for lower contrasts in later portions of the ERPs (middle and right panel of Figure 5A)”.

      To us, this crucially shows that we cannot achieve the same decomposition using traditional ERP analyses. In these plots it appears that while, as described by the reviewer, there is an inversion, the timing and amplitude of the changes due to contrast can hardly be interpreted.

      (6) The first component is picking up on the C1 component (which is negative for these stimulus locations), not a "P100". Please consult any visual evoked potential study (e.g., Luck, Hillyard, etc). It is unexpected that this does not vary in latency with contrast - see, for example. Gebodh et al (2017, Brain Topography) - and there is little discussion of this. Could it be that nonlinear trends were not correctly tested for?  

      We disagree with the reviewer on the interpretation of the ERP. The timing of the detected component is later than the one usually associated with a C1. Furthermore the central display does not create optimal conditions to detect a C1

      We do agree that the topography raises the confusion but we believe that this is due to the spatial frequency of the stimulus that generates a high posterior positivity (see references in the following extract). The new HMP solution also now happens to show an effect of contrast on the P100 latencies, we believe this is due to the increased precision in the time location of the component. We discuss this in the “Visual encoding time” section of the discussion:

      “The following event, the P100, is expressed around 70ms after the N40, its topography is congruent with reports for stimuli with low spatial frequencies as used in the current study (Kenemans et al., 2002, 2000; Proverbio et al., 1996). The timing of this P100 component is changed by the contrast of the stimulus in the direction expected by the Piéron law (Figure 4A)”. 

      (7) There is very little analysis or discussion of the second stage linked to attention orientation - what would the role of attention orientation be in this task? Is it spatial attention directed to the higher contrast grating (and if so, should it lateralise accordingly?), or is it more of an alerting function the authors have in mind here?  

      We agree that we were not specific enough on the interpretation of this attention stage. We now discuss our hypothesis in the section “Attention orientation” of the discussion:  

      “We do however observe an asymmetry in the topographical map Figure 3. This asymmetry might point to an attentional bias with participants (or at least some participants) allocating attention to one side over the other in the same way as the N2pc component (Luck and Hillyard, 1994, Luck et al., 1997). Based on this collection of observations, we conclude that this third event represents an attention orientation process. In line with the finding of Philiastides et al. (2006), this attention orientation event might also relate to the allocation of resources. Other designs varying the expected cognitive load or spatial attention could help in further interpreting the functional role of this third event”.

      We would like to add that it is unlikely that the asymmetry we mention in the discussion cannot stem from the redirection towards higher contrast as the experimental design balanced the side of presentation. We therefore believe that this is a behavioral bias rather than a bias toward the highest contrast stimulus as suggested by the reviewer. We hope that, while more could be tested and discussed, this discussion is sufficient given the current manuscript's goal.

      Reviewer #2 (Public review):  

      Summary:  

      The authors decomposed response times into component processes and manipulated the duration of these processes in opposing directions by varying contrast, and overall by manipulating speed-accuracy tradeoffs. They identify different processes and their durations by identifying neural states in time and validate their functional significance by showing that their properties vary selectively as expected with the predicted effects of the contrast manipulation. They identify 3 processes: stimulus encoding, attention orienting, and decision. These map onto classical event-related potentials. The decision-making component matched the CPP, and its properties varied with contrast and predicted decision-accuracy, while also exhibiting a burst not characteristic of evidence accumulation.  

      Strengths:  

      The design of the experiment is remarkable and offers crucial insights. The analysis techniques are beyond state-of-the-art, and the analyses are well motivated and offer clear insights.  

      Weaknesses:  

      It is not clear to me that the results confirm that there are only 3 processes, since e.g., motor preparation and execution were not captured. While the authors discuss this, this is a clear weakness of the approach, as other components may also have been missed. It is also unclear to what extent topographies map onto processes, since, e.g., different combinations of sources can lead to the same scalp topography.  

      We thank the reviewer for their kind words and for the attention they brought on the question of the missing motor preparation event. In light of this comment (and also R1.1, R3.3) the revised manuscript uses a finer grained approach for the multivariate event detection. This preciser estimation comes from the use of a shorter expected pattern in which the initial expectation of a 50ms half-sine was halved, therefore ensuring that we do not miss events shorter than the initial expectations (see Appendix B of Weindel et al., 2024 and also response to  R1.3). In the new solution the motor component that the reviewer expected is found as evidenced by the topography of the event, its lateralization and a time-to-response congruent with a response execution event. This is now described in the section “Motor execution” of the revised manuscript: 

      “The before last event, identified as the LRP, shows a strong hemispheric asymmetry congruent with a right hand response. The peak of this event is approximately 100 ms before the response which is congruent with reports that the LRP peaks at the onset of electromyographical activity in the effector muscle (Burle et al., 2004), typically happening 100ms before the response in such decision-making tasks (Weindel et al., 2021). Furthermore, while its peak time is dependent on contrast, its expression in the EEG is less clearly related to the contrast manipulation than the following CPP event”.

      Reviewer #3 (Public review):  

      Summary:  

      In this manuscript, the authors examine the processing stages involved in perceptual decision-making using a new approach to analysing EEG data, combined with a critical stimulus manipulation. This new EEG analysis method enables single-trial estimates of the timing and amplitude of transient changes in EEG time-series, recurrent across trials in a behavioural task. The authors find evidence for three events between stimulus onset and the response in a two-spatial-interval visual discrimination task. By analysing the timing and amplitude of these events in relation to behaviour and the stimulus manipulation, the authors interpret these events as related to separable processing stages for stimulus encoding, attention orientation, and decision (deliberation). This is largely consistent with previous findings from both event-related potentials (across trials) and single-trial estimates using decoding techniques and neural network approaches.  

      Strengths:  

      This work is not only important for the conceptual advance, but also in promoting this new analysis technique, which will likely prove useful in future research. For the broader picture, this work is an excellent example of the utility of neural measures for mental chronometry.  

      We appreciate the very positive review and thank the reviewer for pointing out important weaknesses in our original manuscript and also providing resources to address them in the recommendations to authors. Below we comment on each identified weakness and how we addressed them.   

      Weaknesses:  

      (1) The manuscript would benefit from some conceptual clarifications, which are important for readers to understand this manuscript as a stand-alone work. This includes clearer definitions of Piéron's and Fechner's laws, and a fuller description of the EEG analysis technique.

      We agree that the description of both laws were insufficient, we therefore added the following text in the last paragraph of the introduction:

      “Piéron’s law predicts that the time to perceive the two stimuli (and thus the choice situation) should follow a negative power law with the stimulus intensity (Figure 1, green curve). In contradistinction, Fechner’s law states that the perceived difference between the two patches follows the logarithm of the absolute contrast of the two patches (Figure 1, yellow curve). As the task of our participants is to judge the contrast difference, Piéron’s law should predict the time at which the comparison starts (i.e. the stimuli become perceptible), while Fechner’s law should implement the comparison, and thus decision, difficulty”.

      Regarding the EEG analysis technique we added a few elements at the start of the result:

      “The hidden multivariate pattern model (HMP) implemented assumed that a task-related multivariate pattern event is represented by a half-sine whose timing varies from trial to trial based on a gamma distribution with a shape parameter of 2 and a scale, controlling the average latency of the event, free-to-vary per event (Weindel et al., 2024)”.

      We also made the technique clearer at the start of the discussion:

      “By modeling the recorded single-trial EEG signal between stimulus onset and response as a sequence of multivariate events with varying by-trial peak times, we aimed to detect recurrent events that contribute to the duration of the reaction time in the present perceptual decision-making task. In addition to the number of events, using this hidden multivariate pattern approach (Weindel et al., 2024) we estimated the trial-by-trial probability of each event’s peak, therefore accessing at which time sample each event was the most likely to occur”.

      Additionally, we added a proper description in the method section (see the new first paragraph of the “Hidden multivariate pattern” subsection). 

      (2) The manuscript, broadly, but the introduction especially, may be improved by clearly delineating the multiple aims of this project: examining the processes for decision-making, obtaining single-trial estimates of meaningful EEG-events, and whether central parietal positivity reflects ramping activity or steps averaged across trials.

      For the sake of clarity we removed the question of the ramping activity vs steps in the introduction and focused on the processes in decision-making and their single-trial measurement as this is the main topic of the paper. Furthermore the references provided by the reviewer allowed us to write a more comprehensive review of previous studies and how the current study is in line with those. These changes are mainly manifested in these new sentences:

      “As an example Philiastides et al. (2006) used a classifier on the EEG activity of several conditions to show that the strength of an early EEG component was proportional to the strength of the stimulus while a later component was related to decision difficulty and behavioral performance (see also Salvador et al., 2022; Philiastides and Sajda, 2006). Furthermore the authors interpreted that a third EEG component was indicative of the resource allocated to the upcoming decision given the perceived decision difficulty. In their study, they showed that it is possible to use single-trial information to separate cognitive processes within decision-making. Nevertheless, their method requires a decoding approach, which requires separate classifiers for each component of interest and restrains the detection of the components to those with decodable discriminating features (e.g. stimuli with strong neural generators such as face stimuli, see Philiastides et al., 2006)”.

      (3) A fuller discussion of the limitations of the work, in particular, the absence of motor contributions to reaction time, would also be appreciated. 

      As laid out in responses to comments R1.1 and R2 the new estimates now include evidence for a motor preparation component. We discuss this in the new “motor execution” paragraph in the discussion section. Additionally we discuss the limitation of the study and the method in the two last paragraphs of the discussion (in the new Section “Generalization and limitation”).

      (4) At times, the novelty of the work is perhaps overstated. Rather, readers may appreciate a more comprehensive discussion of the distinctions between the current work and previous techniques to gauge single-trial estimates of decision-related activity, as well as previous findings concerning distinct processing stages in decision-making. Moreover, a discussion of how the events described in this study might generalise to different decision-making tasks in different contexts (for example, in auditory perception, or even value-based decision-making) would also be appreciated.  

      We agree that the original text could be read as overstating. In addition to the changes linked to R3.2 we also now discuss the link with the previous studies in the before-last paragraph of the discussion before the conclusion in the new “Generalization and limitations” section:

      “The present study showed what cognitive processes are contributing to the reaction time and estimated single-trial times of these processes for this specific perceptual decision-making task. The identified processes and topographies ought to be dependent on the task and even the stimuli (e.g. sensory events will change with the sensory modality). More complex designs might generate a higher number of cognitive processes (e.g. memory retrieval from a cue, Anderson et al., 2016) and so could more natural stimuli which might trigger other processes in the EEG (e.g. appraisal vs. choice as shown by Frömer et al., 2024). Nevertheless, the observation of early sensory vs. late decision EEG components is likely to generalize across many stimuli and tasks as it has been observed in other designs and methods (Philiastides et al., 2006; Salvador et al., 2022). To these studies we add that we can evaluate the trial-level contribution, as already done for specific processes (e.g. Si et al., 2020; Sturm et al., 2016), for the collection of events detected in the current study”.

      Reviewing Editor Comments:  

      As you will see, all three reviewers agree that the paper makes a valuable contribution and has many strengths. You will also see that they have provided a range of constructive comments highlighting potential issues with the interpretation of the outcomes of your signal decomposition method. In particular, all three reviewers point out that your results do not identify separate motor preparation signals, which we know must be operating on this type of task. The reviewers suggest further discussion of this issue and the potential limitations of your analysis approach, as well as suggesting some additional analyses that could be run to explore this further. While making these changes would undoubtedly enhance the paper and the final public reviews, I should note that my sense is that they are unlikely to change the reviewers' ratings of the significance of the findings and the strength of evidence in the final eLife assessment  

      Reviewer #1 (Recommendations for the authors):  

      (1) Abstract: "choice onset" is ill-defined and not the label most would give the start of the RT interval. Do you mean stimulus onset?  

      We replaced with "choice onset" with "stimulus onset" in the abstract

      (2) Similarly "choice elements" in the introduction seem to refer to sensory attributes/objects being decided about?  

      We replaced "choice-elements" with "choice-relevant features of the stimuli"

      (3) "how the RT emerges from these putative components" - it would be helpful to specify more what level of answer you're looking for, as one could simply answer "when they're done."  

      We replaced with "how the variability in RTs emerges from these putative components"

      (4) Line 61-62: I'm not sure this is a fully correct characterisation of Frömer et al. It was not similar in invoking a step function - it did not invoke any particular mechanism or function, and in that respect does not compare well to Latimer et al. Also, I believe it was the overlap of stimulus-locked components, not response-locked, that they argued could falsely generate accumulator-like buildup in the response-locked ERP.  

      We indeed wrongly described Frömer et al. The sentence is now "In human EEG data, the classical observation of a slowly evolving centro-parietal positivity, scaling with evidence accumulation, was suggested to result from the overlap of time-varying stimulus-related activity in the response-locked event related potential"

      (5) Line 78: Should this be single-trial *latency*?  

      This referred to location in time but we agree that the term is confusing and thus replaced it with latencies.

      (6) The caption of Figure 1 should state what is meant by the y-axis "time"  

      We added the sentence "The y-axis refers the time predicted by each law given a contrast value (x-axis) and the chosen set of parameters." in the caption of Figure 1

      (7) Line 107: Is this the correct description of Fechner's law? If the perceived difference follows the log of the physical difference, then a constant physical difference should mean a constant perceived difference. Perhaps a typo here.  

      This was indeed a typo we replaced the corresponding part of the sentence with "the perceived difference between the two patches follows the logarithm of the absolute contrast of the two patches"

      (8) Line 128: By scale, do you mean magnitude/amplitude?  

      No, this refers to the parameter of a gamma distribution. To clarify we edited the sentence:  "based on a gamma distribution with a shape parameter of 2 and a scale parameter, controlling the average latency of the event, free-to-vary per event"

      (9) The caption of Figure 3 is insufficient to make sense of the top panel. What does the inter-event interval mean, and why is it important to show? What is the "response" event?  

      We agree that the top panel was insufficiently described. To keep the length of the paper short and because of the relatively low amount of information provided by these panels we replaced them for a figure only showing the average topographies as well as the asymmetry tests for each event.

      (10) Figure 4: caption should say what the top vs bottom row represents (presumably, accuracy vs speed emphasis?), and what the individual dots represent, given the caption says these are "trial and participant averaged". A legend should be provided for the rightmost panels.  

      We agree and therefore edited Figure 4. The beginning of the caption mentioned by the reviewer now reads: “A) The panels represent the average duration between events for each contrast level, averaged across participants and trials (stimulus and response respectively as first and last events) for accuracy (top) and speed instructions (bottom).”. Additionally we added legends for the SAT instructions and the model fits.

      (11) Line 189: argued for a decision-making role of what?  

      Stafford and Gurney (2004) proposed that Pieron’s law could reflect a non-linear transformation from sensory input to action outcomes, which they argued reflected a response mechanism. We (Van Maanen et al., 2012) specified this result by showing that a Bayesian Observer Model in which evidence for two alternative options was accumulated following Bayes Rule indeed predicted a power relation between the difference in sensory input of the two alternatives, and mean RT. However, the current data suggest that such an explanation cannot be the full story, as also noted by R3. To clarify this point we replaced the comment by the following sentence:

      “Note that this observation is not necessarily incongruent with theoretical work that argued that Piéron’s law could also be a result of a response selection mechanism (Stafford and Gurney, 2004; Van Maanen et al., 2012; Palmer et al., 2005). It could be that differences in stimulus intensity between the two options also contribute to a Piéron-like relationship in the later intervals, that is convoluted with Fechner’s law (see Donkin and Van Maanen, 2014 for a similar argument). Unfortunately, our data do not allow us to discriminate between a pure logarithmic growth function and one that is mediated by a decreasing power function”.

      (12) Table 2: There is an SAT effect even on the first interval, which is quite remarkable and could be discussed more - does this mean that the C1 component occurs earlier under speed pressure? This would be the first such finding.  

      The original event we qualified as a P100 was sensitive to SAT but the earliest event is now the N40 and isn’t statistically sensitive to speed pressure in this data. We believe that the fact that the P100 is still sensitive to SAT is not a surprise and therefore do not outline it.

      (13) Line 221: "decrease of activation when contrast (and thus difficulty) increases" - is this shown somewhere in the paper?  

      The whole section for this analysis was rewritten (see comment below)

      (14) I find the analysis of Figure 5 interesting, but the interpretation odd. What is found is that the peak of the decision signal aligns with the response, consistent with previous work, but the authors choose to interpret this as the decision signal "occurring as a short-lived burst." Where is the quantitative analysis of its duration across trials? It can at least be visually appraised in the surface plot, and this shows that the signal has a stimulus-locked onset and, apart from the slowest RTs, remains present and for the most part building, until response. What about this is burst-like? A peak is not a burst.  

      This was the residue of a previous version of the paper where an analysis reported that no evidence accumulation trace was found. But after proper simulations this analysis turned out to be false because of a poor statistical test. Thus we removed this paragraph in the revised manuscript and Figure 5 has now been extended to include surface plots for all the events.

      Reviewer #2 (Recommendations for the authors):  

      Overall, I really enjoyed reading this paper. However, in some places the approach is a bit opaque or the results are difficult to follow. As I read the paper, I noted:  

      Did you do a simple DDM, or did you do a collapsing bound for speed?  

      The fitted DDM was an adaptation of the proportional rate diffusion model. We make this clearer at the end of the introduction: "Given that Fechner’s law is expected to capture decision difficulty we connected this law to the classical diffusion decision models by replacing the rate of accumulation with Fechner’s law in the proportional rate diffusion model of Palmer et al.(2005).”

      It is confusing that the order of intervals in the text doesn't match the order in the table. It might be better to say what events the interval is between rather than assuming that the reader reconstructs.  

      We agree and adapted the order in both the text and the table. The table is now also more explicit (e.g. RT instead of S-R)

      Otherwise, I do wonder to what extent the method is able to differentiate processes that yield similar scalp topographies and find it a bit concerning that no motor component was identified.  

      We believe that the new version with the LRP/CPP is a demonstration that the method can handle similar topographies. The method can handle events with close topographies as long as they are separate in time, however if they are not sequential to one another the method cannot capture both events. We now discuss this, in relation with the C1/P100 overlap, in the discussion section “Visual encoding time”:

      “Nevertheless this event, seemingly overlapping with the P100 even at the trial level (Figure 5C), cannot be recovered by the method we applied. The fact that the P100 was recovered instead of the C1 could indicate that only the timing of the P100 contributes to the RT (see Section 3 of Weindel et al., 2024)”.

      And we more generally address the question of overlap in the new section “Generalization and limitation”.

      Reviewer #3 (Recommendations for the authors):  

      Major Comments:  

      (1) If we agree on one thing, it is that motor processes contribute to response time. Line 364: "In the case of decision-making, these discrete neural events are visual encoding, attention-orientation, and decision commitment, and their latency make up the reaction time." Does the third event, "decision commitment", capture both central parietal positivity (decision deliberation) and motor components? If so, how can the authors attribute the effects to decision deliberation as opposed to motor preparation?  

      Thanks to the suggestions also in the public part. This main problem is now addressed as we do capture both a motor component and a decision commitment.

      Line 351 suggests that the third event may contain two components.  

      This was indeed our initial, badly written, hypothesis. Nevertheless the new solution again addresses this problem.

      The time series in Figure 6 shows an additional peak that is not evident in the simulated ramp of Appendix 1.  

      This was probably due to the overlap of both the CPP and the LRP. It is now much clearer that the CPP looks mostly like a ramp while the LRP looks much more like a burst-like/peaked activity. We make this clear in the “Decision event” paragraph of the discussion section:

      “Regarding the build-up of this component, the CPP is seen as originating from single-trial ramping EEG activities but other work (Latimer et al., 2015; Zoltowski et al., 2019) have found support for a discrete event at the trial-level. The ERPs on the trial-by-trial centered event in Figure 5 show support for both accounts. As outlined above, the LRP is indeed a short burst-like activity but the build-up of the CPP between high vs low contrast diverges much earlier than its peak”.

      Previous analyses (Weindel et al., 2024) found motor-related activity from central parietal topographies close to the response by comparing the difference in single-trial events on left- vs right-hand response trials. The authors suggest at line 315 that the use of only the right hand for responding prevented them from identifying a motor event.  

      The use of only the right hand should have made the event more identifiable because the topography would be consistent across trials (rather than inverting on left vs right hand response trials).  

      The reviewer is correct, in the original manuscript we didn’t test for lateralization, but the comment of the reviewer gave us the idea to explicitly test for the asymmetry (Figure 3). This test now clearly shows what would be expected for a motor event with a strong negativity over the left motor cortex.

      The authors state on line 422 that the EEG data were truncated at the time of the response.  

      Could this have prevented the authors from identifying a motor event that might overlap with the timing of the response?  

      We thank the reviewer for this suggestion. This would have been a possibility but the problem is that adding samples after the response also adds the post-response processes (error monitoring, button release, stimulus disappearance, etc.). While increasing the samples after the response is definitely something that we need to inspect, we think that the separation we achieved in this revision doesn’t call for this supplementary analysis.

      The largest effects of contrast on the third event amplitude appear around the peak as opposed to the ramp. If the peak is caused by the motor component, how does this affect the conclusions that this third event shows a decision-deliberation parietal processes as opposed to a motor process (a number of studies suggest a causal role for motor processes in decision-making e.g. Purcell et al., 2010 Psych Rev; Jun et al., 2021 Nat Neuro; Donner et al., 2009 Curr Bio).  

      This result now changed and it does look like the peak capturing most of the effect is no longer true. We do however think that there might be some link to theories of motor-related accumulation. We therefore added this to the discussion in the Motor execution section:

      “Based on all these observations, it is therefore very likely that this LRP event signs the first passage of a two-step decision process as suggested by recent decision-making models (Servant et al., 2021; Verdonck et al., 2021; Balsdon et al., 2023)”.

      I would suggest further investigation into the motor component (perhaps by extending the time window of analysed EEG to a few hundred ms after the response) and at least some discussion of the potential contribution of motor processes, in relation to the previous literature.  

      We believe that the absence of a motor component is sufficiently addressed in the revised manuscript and in the responses to the other comments.    

      (2) What do we learn from this work? Readers would appreciate more attention to previous findings and a clearer outline of how this work differs. Two points stand out, outlined below. I believe the authors can address these potential complaints in the introduction and discussion, and perhaps provide some clarification in the presentation of the results.  

      In the introduction, the authors state that "... to date, no study has been able to provide single-trial evidence of multiple EEG components involved in decision-making..." (line 64). Many readers would disagree with this. For example, Philiastides, Ratcliff, & Sadja (2006) use a single-trial analysis to unravel early and late EEG components relating to decision difficulty and accuracy (across different perceptual decisions), which could be related to the components in the current work. Other, network-based single-trial EEG analyses (e.g., Si et al., 2020, NeuroImage, Sturn et al., 2016 J Neurosci Methods) could also be related to the current component approach. Yet other approaches have used inverse encoding models to examine EEG components related to separable decision processes within trials (e.g., Salvador et al., 2022, Nat Comms). The results of the current work are consistent with this previous work - the two components from Philiastides et al., 2006 can be mapped onto the components in the current work, and Salvador et al., 2022 also uncover stimulus- and decision-deliberation related components.  

      We completely agree with the reviewer that the link to previous work was insufficient. We now include all references that the reviewer points out both in the introduction (see response R3.2) and in the discussion (see response R3.4). We wish to thank the reviewer for bringing these papers to our attention as they are important for the manuscript.

      The authors relate their components to ERPs. This prompts the question of whether we would get the same results with ERP analyses (and, on the whole, the results of the current work are consistent with conclusions based on ERP analyses, with the exception of the missing motor component). It's nice that this analysis is single-trial, but many of the follow-up analyses are based on grouping by condition anyway. Even the single-trial analysis presented in Figure 4 could be obtained by median splits (given the hypotheses propose opposite directions of effects, except for the linear model). 

      We do not agree with the reviewer in the sense that classical ERP analyses would require much more data-points. The performance of the method is here to use the information shared across all contrast levels to be able to model the processing time of a single contrast level (6 trials per participant). Furthermore, as stated in the response to R1.4 and R1.5, the aim of the paper is to have the time of information processing components which cannot be achieved with classical ERPs without strong, and likely false, assumptions.

      Medium Comments:  

      (1) The presentation of Piéron's law for the behavioural analysis is confusing. First, both laws should be clearly defined for readers who may be unfamiliar with this work. I found the proposal that Piéron's law predicts decreasing RT for increasing pedestal contrast in a contrast discrimination paradigm task surprising, especially given the last author's previous work. For example, Donkin and van Maanen (2014) write "However, the commonality ofPiéron's Law across so many paradigms has lead researchers (e.g., Stafford & Gurney, 2004; Van Maanen et al., 2012) to propose that Piéron's Law is unrelated to stimulus scaling, but is a result of the architecture of the response selection (or decision making) process." The pedestal contrast is unrelated to the difficulty of the contrast discrimination task (except for the consideration of Fechner's law). Instead, Piéron's law would apply to the subjective difference in contrast in this task, as opposed to the pedestal contrast. The EEG results are consistent with these intuitions about Piéron's law (or more generally, that contrast is accumulated over time, so a later EEG component for lower pedestal contrast makes sense): pedestal contrast should lead to faster detection, but not necessarily faster discrimination. Perhaps, given the complexity of the manuscript as a whole, the predictions for the behavioural results could be simplified?  

      We agree that the initial version was confusing. We now clarified the presentation of Piéron's law at the end of the introduction (see also response to R2).

      Once Fechner's law is applied, decision difficulty increases with increasing contrast, so Piéron's law on the decision-relevant intensity (perceived difference in contrast) would also predict increasing RT with increasing pedestal contrast. It is unlikely that the data are of sufficient resolution to distinguish a log function from a power of a log function, but perhaps the claim on line 189 could be weakened (the EEG results demonstrate Piéron's law for detection, but do not provide evidence against Piéron's law in discrimination decisions).  

      This is an excellent observation, thank you for bringing it to our attention. Indeed, the data support the notion that Pieron’s law is related to detection, but do not rule out that it is also related to decision or discrimination. In earlier work, we (Donkin & Van Maanen, 2014) addressed this question as well, and reached a similar conclusion. After fitting evidence accumulation models to data, we found no linear relationship between drift rates and stimulus difficulty, as would have been the case if Pieron's law could be fully explained by the decision process (as -indirectly- argued by Stafford & Gurney, 2004; Van Maanen et al., 2012). The fact that we observed evidence for a non-linear relationship between drift rates and stimulus difficulty led us to the same conclusion, that Pieron’s law could be reflected in both discrimination and decision processes. We added the following comment to the discussion about the functional locus of Pieron's law to clarify this point:

      “Note that this observation is not necessarily incongruent with theoretical work that argued that Piéron’s law could also be a result of a response selection mechanism (Stafford and Gurney, 2004; Van Maanen et al., 2012; Palmer et al., 2005). It could be that differences in stimulus intensity between the two options also contribute to a Piéron like relationship in the later intervals, that is convoluted with Fechner’s law (see Donkin and Van Maanen, 2014, for a similar argument). Unfortunately, our data do not allow us to discriminate between a pure logarithmic growth function and one that is mediated by a decreasing power function”.

      (2) Appendix 1 shows that the event detection of the HMP method will also pick up on ramping activity. The description of the problem in the introduction is that event-like activity could look like ramping when averaged across trials. To address this problem, the authors should simulate events (with some reasonable dispersion in timing such that they look like ramping when averaged) and show that the HMP method would not pull out something that looked like ramping. In other words, the evidence for ramping in this work is not affected by the previously identified confounds.  

      We agree that this demonstration was necessary and thus added the suggested simulation to Appendix 1. As can be seen in the Figure 1 of the appendix, when we simulate a half-sine the average ERP based on the timing of the event looks like a half-sine.

      (3) Some readers may be interested in a fuller discussion of the failure of the Fechner diffusion model in the speed condition.  

      We are unsure which failure the reviewer refers to but assumed it was in relation to the behavioral results and thus added: 

      It is unlikely that neither Piéron nor Fechner law impact the RT in the speed condition. Instead this result is likely due to the composite nature of the RT where both laws co-exist in the RT but cancel each other out due to their opposite prediction.

      Minor Comments:  

      (1) "By-trial" is used throughout. Normally, it is "trial-by-trial" or "single-trial" or "trial-wise".

      We replaced all occurrences of “by-trial”  with the three terms suggested were appropriate.

      (2) Line 22: "The sum of the times required for the completion of each of these precessing steps is the reaction time (RT)." The total time required. Processing.  

      Corrected for both.

      (3) Line 26/27: "Despite being an almost two century old problem (von Helmholtz, 2021)." Perhaps the citation with the original year would make this point clearer.  

      We agree and replaced the citation.

      (4) Line 73: "accounted by estimating". Accounted for by estimating.  

      Corrected.

      (5) Line 77 "provides an estimation on the." Of the.  

      Corrected.

      (6) Line 86: "The task of the participants was to answer which of two sinusoidal gratings." The picture looks like Gabor's? Is there a 2d Gaussian filter on top of the grating? Clarify in the methods, too.  

      We incorrectly described the stimuli as those were indeed just Gabor’s. This is now corrected both in the main text and the method section.

      (7) Figure 1 legend: "The Fechner diffusion law" Fechner's law or your Fechner diffusion model?  

      Law was incorrect so we changed to model as suggested.

      (8) Line 115: "further allows to connects the..." Allows connecting the.  

      Corrected.

      (9) Line 123: "lower than 100 ms or higher than..." Faster/slower.  

      Corrected.

      (10) Line 131: "To test what law." Which law.?  

      Corrected to model.

      (11) Figure 2 legend: "Left: Mean RT (dot) and average fit (line) over trials and participants for each contrast level used." The fit is over trials and participants? Each dot is? Average trials for each contrast level in each participant?  

      This sentence was corrected to “Mean RT (dot) for each contrast level and averaged predictions of the individual fits (line) with Accuracy (Top) and Speed (Bottom) instructions.”.

      (12) Line 231: "A comprehensive analysis of contrast effect on". The effect of contrast on.  

      This title was changed to “functional interpretation of the events”.

      (13) Line 23: "the three HMP event with". Three HMP events.

      The sentence no longer exists in the revised manuscript.

      (14) Line 270: "Secondly, we computed the Pearson correlation coefficient between the contrast averaged proportion of correct." Pearson is for continuous variables. Proportion correct is not continuous. Use Spearman, Kendall, or compute d'.  

      The reviewer rightly pointed out our error, we corrected this by computing Spearman correlation.

      (15)  Line 377: "trial 𝑛 + 1 was randomly sampled from a uniform distribution between 0.5 and 1.25 seconds." It's just confusing why post-response activity in Figure 5 does look so consistent. Throughout methods: "model was fitted" should be "was fit", and line 448, "were split".  

      We do not have a specific hypothesis of why the post-response activity in the previous Figure 5 was so consistent. Maybe the Gaussian window (same as in other manuscripts with a similar figure, e.g. O’Connell et al. 2012) generated this consistency. We also corrected the errors mentioned in the methods.

      (16) The linear mixed models paragraph is a bit confusing. Can it clearly state which data/ table is being referred to and then explain the model? "The general linear mixed model on proportion of correct responses was performed using a logit link. The linear mixed models were performed on the raw milliseconds scale for the interval durations and on the standardized values for the electrode match." We go directly from proportion correct to raw milliseconds...  

      The confusion was indeed due to the initial inclusion of a general linear mixed model on proportion correct which was removed as it was not very informative. The new revision should be clearer on the linear mixed models (see first sentence of subsection ‘linear mixed models' in the method section).

      (17) A fuller description of the HMP model would be appreciated.  

      We agree that this was necessary and added the description of the HMP model in the corresponding method section “Hidden multivariate pattern” in addition to a more comprehensive presentation of HMP in the first paragraph of the Result and Discussion sections.

      (18) Line 458: "Fechner's law (Fechner, 1860) states that the perceived difference (𝑝) between the two patches follows the logarithm of the difference in physical intensity between..." ratio of physical intensity.  

      Corrected.

      (19) P is defined in equations 2 and 4. I would include the beta in equation 4, like in equation 2, then remove the beta from equations 3 and 5 (makes it more readable). I would also just include the delta in equation 2, state that in this case, c1 = c+delta/2 or whatever.  

      This indeed makes the equation more readable so we applied the suggestions for equations 2, 3, 4 and 5. The delta was not added in equation 2 but instead in the text that follows:

      “Where 𝐶1 = 𝐶0 + 𝛿, again with a modality and individual specific adjustment slope (𝛽).” 

      (20) The appendix suggests comparing the amplitudes with those in Figure 3, but the colour bar legend is missing, so the reader can only assume the same scale is used?  

      We added the color bar as it was indeed missing. Note though that the previous version displayed the estimation for the simulated data while this plot in the revised manuscript shows the solution on real data obtained after downsampling the data (and therefore look for a larger pattern as in the main text). We believe that this representation is more useful given that the solution for the downsampled data is no longer the same as the one in the main text (due to the difference in pattern width).

    1. Author response:

      Reviewer #1:

      (1) We fully thank you to point out the risks of sensationalizing ramification of procrastination on psychopathology, and would rewrite the Introduction section by adding balanced evidence and overall toning down such inappropriate claims meanwhile.

      (2) Thank you to raise this crucial question. We are sorry for this fundamental technical issue to preregistration. This occurs from a seriously technical hurdle. The OSF has banned my OSF account, as it claimed to detect “suspicious user’s activities” in my account. This causes no accesses to all materials that already deposited in this OSF account, including this preregistration. We have contacted OSF team, but received no valid technical solution. We reckon that this may be mistaken by my affiliation changes to Third Military Medical University of People’s Liberation Army (PLA). To tackle with this technical issue, we shall upload preregistration in a new repository soon.

      (3) This is a back-to-back study to conceptually probe into whether strengthening left DLPFC can mitigate procrastination via reducing task aversiveness or weighting outcome value. Thus, the current study selected a medium effect size in aprior by following the previous one (Xu et al., 2023). This effect size is calculated by the new tool called “Power Contours” (Baker et al., 2021), which weights statistical power by increasing within-subject repeated measures. As you kindly pointed out, we shall clarify effect size calculation in the revised manuscript.

      (4) Yes, both groups come in the same number of times into the lab for tDCS stimulation, except to the type (active vs sham).

      (5) We shall add full details for clarifying TDM and hyperbolic discounting modeling.

      (6) Thank you to raise this very crucial statistical question. We shall double-check whether multiple sessions are modeled as random slopes, and would like to reanalysis it in case which those random slopes are omitted.

      (7) Thank you. We have no intentions of confusing you by adding those complicated statistics, but indeed enrich understanding of how we can interpret those findings.

      (8) Yes, as mentioned above, we shall add balanced evidence to clarify both left and right DLPFC may function to self-control capability in the Introduction section.

      (9) Yes, this is a conceptual hypothesis --- actively stimulating left DLPFC could improve self-control functions. Thank you for this very nuanced but crucial insight, and we could explicitly clarify the nature of our conclusions.

      (10) Yes, we ensure that all the participants successfully completed their tasks before deadline at session 6 and 7, and the procrastination rates have been all decreased to 0. Personally speaking, this is somewhat surprise to us as well, but we affirmed this case. For a portion of participants included in the active group, we have received written letters of thanks from them. Thus, this is surprise but exciting finding. Furthermore, thank you for this helpful suggestion, and we would like to do this robustness check by iteratively removing each session, to obviate the statistical biases from an extreme pattern.

      (11) Yep, we fully agree with you to add full details in the main text rather in Supplemental materials, and would like to do so in the first round of revision.

      Reviewer #2:

      (1) Thank you for this very crucial suggestion. We are sorry for this case that much details are omitted to comply with editorial requirement at Nature Human Behaviour (last submission). We do apologize to confuse you as those ambiguous descriptions, and would like to clearly clarify how we measure participants’ procrastination in the real-world tasks. In brief, we asked participant to report a real task that would really happen in the tomorrow and its deadline is also no more than tomorrow. When tomorrow comes, we used ESM to require participant reporting real task completion rate (0-100%) at five time points before the deadline. The five time points are determined by a hyperbolic discounting model (see how and why we set those five time points in the full author’s response letter later). When participant reports the real task completion rate (0-100%) at a given time point, she/he is required to provide a photo to prove its authenticity. The dependent variable --- real-world procrastination rates --- is thus calculated as 100% subtracts the task completion rate (0-100%) when the deadline meets. That is to say, if participant reports task has been fully completed before or when deadline meets, his/her real-world procrastination rate is 100% - 100% = 0%; if reporting task has been completed 60% when deadline meets, the real-world procrastination rate is determined as 100% - 60% = 40%. Do not worry for spurious reporting, we asked all the participants to provide photo verifying the real task completion rate. This is merely a short instance. We shall show the full details in the formal author response letter later.

      (2) This is a very meaningful point. We agree with you for this case that participants may learn how to complete this experiment task swiftly rather benefit from neuromodulation. This speculation makes sense, but is compromised by experimental control and empirical observations. Firstly, we do not say “You must complete this task” or “The task completion is associated with bonus/rewards you may get” for participants, which indicates no motivations to do so. Then, the measures to task completion rate are not yet fully based on self-reporting, and we mandate them to provide photos for verification. Thus, this controls the marked risks of spurious reporting. Lastly, all the participants, including ones in either active or sham group, received all the same treatments, excepting “real simulation” and “sham simulation” protocol. Results demonstrated the significant amelioration in the active group rather sham one, indicating no significant “placebo” or “task learning” side effect.

      (3) Thank you. As you kindly suggested, we would like to add huge details for those measures in the revised manuscript. While this is a great idea, we did not collect procrastination scores from scales after neuromodulation, and would like to warrant this point into the Limitation section.

      (4) Yep, this is a conceptual hypothesis --- actively stimulating left DLPFC could improve self-control functions. We cannot rule out possibilities of amplifying working memory, attention or other cognitive components from this neuromodulation protocol. We fully agree with you for this helpful recommendation --- we would like tone down those claims regarding the roles of DLPFC on self-control, and explicitly warrant that this mechanism may be specialized to the procrastination.

      Reviewer #3:

      (1) Thank you for taking valuable time to review our manuscript. Yep, limited sample size should warrant cautions to draw a solid conclusion. We would like to claim it into the limitation section. Also, we have streamlined and tightened statistic section by removing complicated and redundancy statistical models.

      (2) As mentioned above, we are sorry for this fundamental technical issue to preregistration. This occurs from a seriously technical hurdle. The OSF has banned my OSF account, as it claimed to detect “suspicious user’s activities” in my account. This causes no accesses to all materials that already deposited in this OSF account, including this preregistration. We have contacted OSF team, but received no valid technical solution. We reckon that this may be mistaken by my affiliation changes to Third Military Medical University of People’s Liberation Army (PLA). To tackle with this technical issue, we shall upload preregistration in a new repository soon.

      (3) Yep, thank you for this very helpful suggestion. As you kindly indicated, we would like to clarify measures, analyses, methods, and protocols, as well as tighten the whole manuscript.

      References

      Baker, D. H., Vilidaite, G., Lygo, F. A., Smith, A. K., Flack, T. R., Gouws, A. D., & Andrews, T. J. (2021). Power contours: Optimising sample size and precision in experimental psychology and human neuroscience. Psychological methods, 26(3), 295–314. https://doi.org/10.1037/met0000337

      Xu, T., Zhang, S., Zhou, F., & Feng, T. (2023). Stimulation of left dorsolateral prefrontal cortex enhances willingness for task completion by amplifying task outcome value. Journal of experimental psychology. General, 152(4), 1122-1133. https://doi.org/10.1037/xge0001312

      Again, we wholeheartedly appreciate all of those very helpful and insightful comments, with each one to contribute substantially for the quality of this manuscript. Notably, those response we presented above are merely provisional and initial. We shall revise our manuscript following those suggestions, one-by-one, along with a full-length response letter.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This report demonstrates that the gene expression output of the Wnt pathway, when controlled precisely by a synthetic light-based input, depends substantially on the frequency of stimulation. The particular frequency-dependent trend that is observed - anti-resonance, a suppression of target gene expression at intermediate frequencies given a constant duty cycle - is a novel aspect that has not been clearly shown before for this or other signaling pathways. The paper provides both clear experimental evidence of the phenomenon with engineered cellular systems and a model-based analysis of how the pairing of rate constants in pathway activation/deactivation could result in such a trend.

      Strengths:

      This report couples in vitro experimental data with an abstracted mathematical model. Both of these approaches appear to be technically sound and to provide consistent and strong support for the main conclusion. The experimental data are particularly clear, and the demonstration that Brachyury expression is subject to anti-resonance in ESCs is particularly compelling. The modeling approach is reasonably scaled for the system at the level of detail that is needed in this case, and the hidden variable analysis provides some insight into how the anti-resonance works.

      Weaknesses:

      (1) The anti-resonance phenomenon has not been demonstrated using physiological Wnt ligands; however, I view this as only a minor weakness for an initial report of the phenomenon. The potential significance of the phenomenon for Wnt outweighs the amount of effort it would take to carry the demonstration further - testing different frequencies/duty cycles at the level of ligand stimulus using microfluidics could get quite involved, and would likely take quite some time. Adding some more discussion about how the time scales of ligand-receptor binding could play into the reduced model would further ameliorate this issue.

      We thank the reviewer for this comment and the interesting suggestion to test the anti-resonance phenomenon with microfluidics. We agree that combining physiological Wnt ligands with microfluidic stimulation would go beyond the scope of this current study, though it is an interesting extension. One advantage of the optogenetic setup, as mentioned in the discussion, is that the Wnt stimulus can be turned off sharply. This allows us to test the output from perfectly square wave input profiles; in microfluidics, washing the sticky ligand off the cells might “smear” the effective input profile cells respond to.

      We show in Supplement Fig. 6, that our reduced model matches the experimental data and that we would expect the antiresonance phenomenon as long as (see Fig. 4). Practically, a smeared input profile implies an effective reduction of 𝑘<sub>off</sub>, which means that the phenomenon would be visible with microfluidics (provided the minimum is deep enough, see Fig. 4). However, this should still be considered with caution, as the antiresonance would then appear because the cells essentially receive a smeared out or continuous pulse in the high frequency limit, rather than cells responding to a square wave in a specific way.

      (2) While the model is fully consistent with the data, it has not been validated using experimental manipulations to establish that the mechanisms of the cell system and the model are the same. There may be some ways to make such modifications, for example, using a proteasome inhibitor. An alternative would be to more explicitly mention the need to validate the model's mechanism with experiments.

      We thank the reviewer for this valuable and constructive comment. We agree that future experimental perturbations that directly modulate pathway activation and reset kinetics—such as proteasome inhibition, targeted degradation of pathway components, or engineered changes in receptor turnover—would provide an important validation of the model’s mechanistic interpretation. In the present study, our primary goal was to establish the existence and quantitative features of anti-resonance in the Wnt pathway and to identify the minimal set of timescale relationships that can explain it. We view the proposed experimental validations as exciting next steps that extend beyond the scope of the current work, and we are grateful to the reviewer for emphasizing their importance. We now mention this explicitly in the discussion of our manuscript.

      (3) I think the manuscript misses an opportunity to discuss the potential of the phenomenon in other pathways. The hedgehog pathway, for example, involves GSK3-mediated partial proteolysis of a transcription factor, which could conceivably be subject to similar behaviors, and there are certainly other examples as well.

      We thank the reviewer for pointing out an opportunity to emphasize the possibility of this phenomenon in other pathways. The minimal model indicates that anti-resonance emerges whenever a rapid activating process is paired with a slower deactivating/reset process. Beyond Hedgehog/Gli processing, candidate circuits include: NF-κB (rapid IκBα phosphorylation/degradation vs slower IκBα resynthesis), ERK (fast phosphorylation bursts vs slower transcriptional negative feedback such as DUSPs), Notch (fast γ-secretase NICD release vs slower NICD turnover and feedback), BMP/TGF-β–SMAD (fast R-SMAD phosphorylation vs slower receptor trafficking/SMAD7 feedback), and Hippo/YAP (rapid cytoplasmic sequestration vs slower transcriptional feedback). Each contains the same timescale separation that should create a frequency ‘stop-band,’ predicting suppressed gene expression or fate transitions at intermediate stimulation frequencies. We have updated the manuscript’s discussion to mention the Hedgehog connection with the following added sentence in the discussion: Analogous band-stop filtering should arise in other developmental circuits that couple a fast ‘ON’ step to slower deactivation or negative feedback. In Hedgehog, for example, PKA/CK1/GSK3-mediated partial proteolysis of Gli with slower recovery of full-length Gli creates the same fast-activation/slow-reset motif our hidden-variable model predicts will yield anti-resonance, and Wnt–Hedgehog crosstalk through the shared kinase GSK3 suggests such frequency selectivity could occur in other developmental signaling pathways.

      We also added an additional sentence regarding different activation and deactivation timescales in other pathways.

      (4) Some aspects of the modeling and hidden variable analysis are not optimally presented in the main text, although when considered together with the Supplemental Data, there are no significant deficiencies.

      We have addressed the model choices and analysis now more clearly in the main manuscript and also referred to the Supplemental Data more directly.

      Reviewer #2 (Public review):

      Summary:

      By combining optogenetics with theoretical modelling, the authors identify an anti-resonance behavior in the WnT signaling pathway. This behavior is manifested as a minimal response at a certain stimulation frequency. Using an abstracted hidden variable model, the authors explain their findings by a competition of timescales. Furthermore, they experimentally show that this anti-resonance influences the cell fate decision involved in human gastrulation.

      Strengths:

      (1) This interdisciplinary study combines precise optogenetic manipulation with advanced modelling.

      (2) The results are directly tested in two different systems: HEK293T cells and H9 human embryonic stem cells.

      (3) The model is implemented based on previous literature and has two levels of detail: i) a detailed biochemical model and ii) an abstract model with a hidden parameter.

      Weaknesses:

      (1) While the experiments provide both single-cell data and population data, the model only considers population data.

      We thank the reviewer for correctly pointing out that the single-cell measurements would in principle allow us to incorporate the cell-to-cell heterogeneity into the model. In this study, we sought to identify a minimal quantitative model of the Wnt pathway that could explain anti-resonance through competing time scales. We believe that, for our purposes, focusing on population data allowed us to keep the complexity of the model to a minimum to increase its explanatory value. We agree with the reviewer that considering single-cell trajectories is an interesting direction for further work.

      (2) Although the model captures the experimental data for TopFlash very well, the beta-Cat curves (Figure 2B) are only described qualitatively. This discrepancy is not discussed.

      Indeed, our model fits to mean β-catenin expressions are more qualitative than for TopFlash. The fit for β-catenin was tricky, as expression of β-catenin is typically low and closer to the detectable limits than TopFlash. These experimental constraints mean that the variation between individual signal trajectories is higher for β-catenin compared to the light-off condition than for TopFlash. Therefore, we strove to obtain a qualitative rather than a quantitative fit to the mean expression profile in β-catenin.  The current model fit is well within the standard deviation of variation. Given the observed heterogeneity and the fact that we take the parameters from literature (which ensures that the order of magnitude of parameters is in a sensible range), we believe that the model fits are reasonable. We now mention this explicitly in the text.

      Overall Assessment:

      The authors convincingly identified an anti-resonance behavior in a signaling pathway that is involved in cell fate decisions. The focus on a dynamic signal and the identification of such a behavior is important. I believe that the model approach of abstracting a complicated pathway with a hidden variable is an important tool to obtain an intuitive understanding of complicated dependencies in biology. Such a combination of precise ontogenetic manipulation with effective models will provide a new perspective on causal dependencies in signaling pathways and should not be limited only to the system that the authors study.

      We thank both reviewers for the positive assessment of our manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      There are several points that deserve more discussion, as noted above in the review.

      (1) It would be worthwhile to consider whether a relatively simple experiment with a proteasome inhibitor or similar pharmacological manipulation could provide useful validation data for the model.

      We address this point above in the weaknesses section from reviewer 1.

      (2) The figure legend for S5C should clarify whether the values plotted are at a particular fixed time point, or (more likely) at a certain time following the second pulse, which would be variable.

      We have modified the figure caption to clarify that the values plotted are at a fixed time point in the simulation (t\=48 hrs). We chose this timepoint sufficiently long after the second pulse to ensure that there are no residual dynamical effects. We thank the reviewer for noting this.

      (3) As noted in the Sci Score document, various aspects of the resource reporter should be improved, such as including RRIDs, etc.

      We are sending out our plasmids to AddGene; versions for Python and Matlab are listed in our methods section.

      Reviewer #2 (Recommendations for the authors):

      I mostly have suggestions to improve the clarity of the presentation.

      (1) Not all symbols in the equations given in the main text are explained. This is rather annoying, because either you present them and explain what they are or you don't show them and refer to the supplements. For example, d_0 or c_o or \bar{b} or n or K are not explained.

      We have now more clearly presented the parameters in the main text and added signposts to the Methods section.

      (2) Overall, it is often not clear what data in the figures are redundant, although the authors referred to them in the text. For example, in Figure 2c, a curve for 24 hours is shown and referred back to Figure 1D. However, in Figure 1D there is no curve for 24 hours. Is the data from Supplementary Figure 1 H and K also in the main text?

      We thank the referee for pointing out these redundancies. We have now included the 24hr line in Figure 1D and are now only showing the unsmoothed data, also in the main text of the manuscript. To clarify supplemental figures, we have now removed S1H and S1K since all they showed was the unsmoothed version of the data. The remaining plots in Supplementary Figure 1 are normalized differently from what we show in Figure 1 to demonstrate our choice of normalization is not the reason for the observed optogenetic response.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      (1) The authors state that more is known about glial reactivation than cell-cycle re-entry. They are confusing many points here. More gene networks that require cell-cycle re-entry are known. Some of the genes listed for "reactivation" are, in fact, required for cell cycle re-entry/proliferation. And the authors confuse gliosis vs glial reactivation.

      We thank the reviewer for this important and constructive comment. We fully agree that clearly distinguishing between the concepts of glial reactivation, glial proliferation, gliosis, and neurogenesis is essential to avoid conceptual confusion in our study.

      Injury-induced retinal regeneration in zebrafish:

      Glial reactivation refers to the initial response of quiescent Müller glia (MG) to injury, characterized by morphological changes and upregulation of reactive markers (e.g., gfap, ascl1a, lin28a) and activation of signaling pathways such as Notch, Jak/Stat, and Wnt (Lahne et al., 2020; Pollak et al., 2013; Sifuentes et al., 2016; Yao et al., 2016).

      Glial proliferation refers to the clonal expansion of these MG-derived progenitor cells, which undergo rapid cell-cycle re-entry and amplify to generate sufficient progenitors for regeneration (Iribarne and Hyde, 2022; Lee et al., 2024; Wan and Goldman, 2016)

      Gliosis vs neurogenesis represents a divergent fate decision following proliferation. In zebrafish, MG-derived progenitor cells differentiate into retinal neurons that can replace those damaged or lost due to retinal injury. In contrast, mammalian MG tend to undergo an initial gliotic surge and rapidly revert to a quiescent state, exhibiting gliosis and glial scarring (Thomas et al., 2016; Yin et al., 2024). Thus, we totally agreed that gliosis cannot be confused with glial reactivation because glial reactivation is the very first step of glial injury responses, whereas gliogensis is the very last glial response to the injury.

      We agree with the reviewer that many genes typically described as “reactivation markers” (e.g., ascl1a, lin28a, sox2, mycb, mych) are also essential regulators of cell-cycle re-entry (Gorsuch et al., 2017; Hamon et al., 2019; Lee et al., 2024; Lourenço et al., 2021; Pollak et al., 2013; Thomas et al., 2016). Because the glial reactivation is a leading event for glial proliferation, the regulators of glial reactivation are expected to be responsible for glial proliferation as well.

      In our study, we focused on the states preceding glial proliferation to understand the mechanism underlying injury-induced glial cell-cycle re-entry. We defined these transitional states and the subsequent proliferative MG states based on single-cell RNA-seq trajectory analysis. (revised lines: 41-58)

      (2) A major weakness of the approach is testing cone ablation and regeneration in early larval animals. For example, cones are ablated starting the day that they are born. MG that are responding are also very young, less than 48 hrs old. It is also unclear whether the immune response of microglia is a mature response. All of these assays would be of higher significance if they were performed in the context of a mature, fully differentiated, adult retina. All analysis in the paper is negatively affected by this biological variable.

      We thank the reviewer for raising this important point regarding the developmental stage of the retina in our model system. We have carefully considered this concern and now provide additional clarification and justification, as follows:

      (1) The glial responses in larval and adult retina:

      Previous studies have demonstrated that injury-induced glial responses are largely conserved in larval and adult zebrafish retina, including reactive gliosis marked by gfap upregulation and proliferation(Meyers et al., 2012; Sarich et al., 2025). In our study, G/R cones were ablated beginning at 5 dpf using metronidazole (MTZ), and we observed robust induction of PCNA⁺ MG in the inner nuclear layer, consistent with injury-induced proliferation (Figure 1E). These findings align with previous studies showing that key features of MG regenerative responses are conserved across larval and adult stages.

      (2) The microglial responses in larval and adult retina:

      Retinal microglia functionally mature at 5 dpf in the zebrafish retina (Mazzolini et al., 2020; Svahn et al., 2013), and prior studies have demonstrated that microglia in larval and adult zebrafish exhibit similar responses to injury, including migration, morphological activation, and phagocytosis(Nagashima and Hitchcock, 2021; White et al., 2017). In our experiments using Tg(mpeg1: GFP) larvae, we observed clear microglial recruitment to the outer nuclear layer (ONL) following cone ablation (Figure 1E and Figure 1-figure supplement 1A), supporting the functional competence of larval microglia in injury-induced immune responses

      (3) The contribution using larval animals to study the regeneration program:

      We agree that regeneration studies in the adult retina can provide important biological insights, particularly in a fully differentiated tissue environment. Accordingly, we have acknowledged this limitation in our revised manuscript “limitations of this study” section (revised lines 534-540: “1. Our study focuses on larval zebrafish, in which the core features of MG and immune responses are conserved compared to the adult. However, we acknowledge that the adult retina—with its fully matured differentiated retina and immune response—provides irreplaceable biological insight. Nevertheless, larval models offer a powerful platform to uncover conserved regenerative mechanisms and serve as a valuable complement for identifying age-dependent differences in MG-mediated regeneration.”) and have stated our intention to extend future analyses to adult zebrafish, especially to explore age-dependent differences in redox signaling and MG proliferation. At the same time, we believe that the larval model offers unique advantages for uncovering fundamental, conserved mechanisms of regeneration and enables characterization of age-dependent regulatory differences. Thus, our study in larval animals serves as a complementary and informative platform for understanding both the conserved and developmental stage-specific features of injury-induced regeneration.

      (4) Related to the above point, the clonal analysis of cxcl18b+ MG is complicated by the fact that new MG are still being born in the CMZ (as are new cones for that matter).

      We thank the reviewer for raising this important point regarding potential contributions from CMZ-derived progenitors to the lineage-traced cxcl18b⁺ MG clones. To address this concern, we have implemented evidence to rule out a CMZ origin for the clones analyzed:

      Spatial restriction of clones: All clones included in our analysis were located exclusively within the central and dorsal retina, as shown in Figure 2H. From the spatial distribution of reactive MG populations across the retina, we observed a patterned organization in which the vast majority of proliferating MG arose from local mature MG–derived progenitors, rather than from peripheral CMZ-derived progenitors. However, we acknowledge that we cannot entirely exclude the possibility that CMZ-derived progenitors contribute to injury-induced MG proliferation, particularly in the peripheral retina.

      We have clarified this point in the revised Methods section (revised lines 756–762: “Clone analysis of cxcl18b<sup>+</sup> lineage-traced MG was restricted to cells located in the central and dorsal region of the zebrafish retina after G/R cone ablation in Figure 2, Figure 6, and their figure supplement. This spatial restriction strongly suggests that the proliferative MG originate from local mature MG, although we cannot completely rule out the possibility that CMZ-derived progenitors contribute to the generation of proliferative MG in the peripheral retina.”) and updated the corresponding figure legends.

      (4) A near identical study was already done by Hoang et al., 2020, in adult zebrafish, a more relevant biological timepoint. Did the authors check this published RNA-seq database for their gene(s) of interest?

      We thank the reviewer for pointing out the relevance of the study by Hoang et al., 2020, which characterized the transcriptional dynamics of MG reactivation in the adult zebrafish retina. We agree that comparisons with their single-cell RNA-seq dataset are important to confirm the conservation of our findings in larval vs adult zebrafish.

      To this end, we examined the adult zebrafish MG dataset reported by Hoang et al., and confirmed that cxcl18b is also present and enriched in their analysis, particularly in activated MG populations under various injury paradigms:

      (1) cxcl18b is listed as a differentially expressed gene (DEG) in Supplementary Table ST2, enriched in GFP⁺ MG following injury. It is also significantly upregulated in both NMDA-induced and light damage conditions, as shown in Supplementary Table ST3.

      (2) In Supplementary Table ST5, cxcl18b is identified as a classifier of activated MG, with classification power scores of 0.552 (NMDA), 0.632 (light damage), and 0.574 (TNFα + γ-secretase inhibitor treatment), indicating its consistent expression across multiple injury models.

      (3). In their pseudotime analysis (Figure 4C and Supplementary Table ST8), cxcl18b is specifically expressed in Module 5, which is expressed earlier along the trajectory than ascl1a. This temporal pattern of cxcl18b preceding ascl1a expression is consistent with our trajectory analysis in larval MG (Figure 1H), further supporting its role as an early marker of the transitional state before proliferation.

      These findings underscore the robustness and biological relevance of cxcl18b as a conserved marker of injury-responsive MG in both larval and adult zebrafish. Our data expand upon the prior work by specifically characterizing a cxcl18b-defined transitional MG state preceding cell-cycle re-entry, thereby offering additional insights into the temporal staging of MG activation during regeneration.

      (5) KD of cxcl18b did not affect MG proliferation or any other defined outcome. And yet the authors continually state such phrases as "microglia-mediated inflammation is critical for activating the cxcl18b-defined transitional states that drive MG proliferation." This is false. Cxcl18b does not drive MG proliferation at all.

      We thank the reviewer for raising this concern. We agree with the reviewer and have revised this statement as "These findings suggest that microglia-mediated inflammation may contribute to the activation of cxcl18b-defined transitional states that precede MG proliferation, although a causal relationship remains to be established." (revised lines 251-253).

      (6) A technical concern is that intravitreal injections are not routinely performed in larval fish.

      We appreciate the reviewer’s technical concern regarding the use of intravitreal injections in larval zebrafish. In our study, we performed intraocular injection according to previously established methods (Alvarez et al., 2009; Giannaccini et al., 2018; Rosa et al., 2023). This approach involves carefully delivering a small volume of viral suspension into the intraocular space by a glass micropipette. To address this concern, we will revise the Materials and Methods section to clearly describe the injection procedure and will cite the relevant references accordingly.

      Reviewer #2:

      (1) The authors note a peak of PCNA+ Muller glia at 72 hours post injury. This is somewhat surprising as the MG would be expected to generate progenitor cells that would continue proliferating and stain with PCNA. Indeed, only a handful of PCNA+ cells are seen in the INL/ONL layer in Figure 1E2 with few clusters of progenitors present. It would be helpful to stain with a Muller glia marker to confirm these PCNA+ cells are Muller glia. It's also curious that almost all the PCNA+ cells are in the dorsal retina, even though G/R cone loss extends across both dorsal and ventral retina. Is this typical for cone ablation models in larval zebrafish?

      We thank the reviewer for their insightful comment regarding the spatial distribution and identity of PCNA⁺ cells following injury.

      In our study, we observed that the injury-induced proliferating cells (PCNA⁺) were predominantly located in the central and dorsal regions of the retina at 72 hours post-injury (hpi) (Figure 1E). To verify the identity of these proliferating cells, we performed additional immunostaining using BLBP, and confirmed that the majority of PCNA⁺ cells also express BLBP (Figure 1–figure supplement 1B in our revised Data), these results supporting their MG origin.

      The regional bias of MG proliferation towards the central and dorsal retina is consistent with previous findings. Notably, (Krylov et al., 2023) demonstrated that MG exhibit region-specific heterogeneity in their regenerative responses to photoreceptor ablation. Their study identified proliferative MG subpopulations predominantly in the central (fgf24-expressing) and dorsal (efnb2a-expressing) domains, whereas ventral MG showed limited proliferative capacity (Krylov et al., 2023). These observations provide a plausible explanation for the spatially restricted PCNA⁺ MG population observed in our model following cone ablation.

      (2) In Line 148: What is meant by "most original MG states" in this context? Original meaning novel? Or original meaning the earliest state MG adopted following injury? The language here is confusing.

      We thank the reviewer for pointing out the ambiguous phrasing in our original manuscript. The term “most original MG states” was imprecise and misleading, as it could be interpreted as referring to the quiescent state of MG. In our context, we intended to describe the earliest transitional states in MG respond to injury, as they begin to exit quiescence and enter reactive characteristics. These early transitional MG populations co-express quiescent markers such as cx43 and early reactive markers gfap, as shown in Figure 1H.

      To avoid confusion and improve conceptual clarity, we have revised the manuscript by replacing “most original MG states” with “early transitional MG state” (revised line 154) and have added a clearer explanation in the corresponding Results section to define this population more accurately.

      (3) Perhaps provide a better image in Figure 2A of the cxcl18b at 48 hpi and 72 hpi. The current images appear virtually identical, with very little cxcl18b expression observed, especially compared to the 24 hpi. This is in contrast to the Tg(cxcl18b:GFP) transgenic line shown in Figure 2D, which indicates either much higher expression in proliferating cells at 48 hpi or the stability of GFP protein. Can the authors provide guidance on the accurate temporal expression of cxcl18b? Does expression peak rapidly at 24 hpi and then rapidly decline or is there persistence of expression to 48-72 hpi?

      We appreciate the reviewer’s careful observation regarding the apparent similarity of cxcl18b expression at 48 hpi and 72 hpi in the in situ hybridization (ISH) images (Figure 2A), and the differences compared to the Tg(cxcl18b: GFP) reporter line shown in Figure 2D.

      (1) The similarity of ISH images at the 48 hpi and 72 hpi (Figure 2A):

      The cxcl18b mRNA signal peaked at 24 hpi, suggesting a rapid transcriptional response after retina injury. By 48 hpi, cxcl18b expression had already declined substantially, and by 72 hpi, the signal was further reduced to near-background levels. This temporal expression pattern explains why the ISH images at 48 hpi and 72 hpi appear nearly identical and much weaker compared to 24 hpi.

      (2) The discrepancy between ISH and GFP reporter signal (Figure 2D):

      The Tg(cxcl18b: GFP) reporter line shows persistent GFP expression beyond the transcriptional window of cxcl18b mRNA. This may be due to the prolonged stay of GFP protein, which remains detectable even after the endogenous transcription of cxcl18b has diminished. This explanation is also noted in the manuscript (revised lines 198–200). As a result, GFP⁺ MG cells are still visible at 48–72 hpi, and some of them co-label with PCNA.

      These findings are consistent with our Pseudotime analysis based on scRNA-seq data (Figure 1H), which shows that cxcl18b expression precedes the induction of proliferative markers such as pcna and ascl1a.

      (4) Line 198: The establishment of the Tg(cxcl18b:Cre-vhmc:mcherry::ef1a:loxP-dsRed-loxP-EGFP::lws2:nfsb-mCherry) is considerable but the nomenclature doesn't properly fit. Is the mCherry fused with Cre and driven by the cxcl18b promoter? What is the vhmc component? Finally, while this may provide the ability to clonally track cxcl18b-expressing MG, it does not address the prior question of what is the actual temporal expression of cxcl18b? If anything, this only addresses whether proliferating MG expressed cxcl18b at some point in their history, but does not indicate that cxcl18b expression co-exists in proliferating cells. The most convincing evidence is in Supplemental Figure 2B.

      The "vmhc" component refers to the ventricular myosin heavy chain promoter, commonly used to label atrial cardiomyocytes (Jin et al., 2009). We cloned the vmhc upstream region containing its promoter and fusing with mCherry for selection during transgenic fish line construction.

      Clone analysis using the Tg(cxcl18b: Cre-vmhc: mCherry::ef1a: loxP-DsRed-loxP-EGFP::lws2: nfsb-mCherry) further indicates that cxcl18b-defined the transitional state is the essential routing for MG proliferation. We have clarified in the revised text that this lineage tracing indicates a “history of injury-induced cxcl18b expression” rather than its ongoing expression during proliferation (revised line 205).

      (5) Line 203: The data shown in Figure 2F do not indicate that these MG are cxcl18b+. Rather, the data are consistent with the interpretation that these MG expressed Cre at some prior stage and now express GFP from the ef1a promoter rather than DsRed. Whether these MG continue to express cxcl18b at the time these fish were collected is not addressed by these data. It is not accurate to conclude that these cells are cxcl18b+.

      We thank the reviewer for pointing out this important issue. We agreed that the GFP<sup>+</sup> MG shown in Figure 2F represents cells that have previously expressed cxcl18b and thus belong to the cxcl18b-expressing cell lineage, but this does not indicate that they continue to express cxcl18b at the time of sample collection. Performing clonal analysis using the Cre-loxp system, the GFP signal reflects historical cxcl18b promoter activity rather than ongoing transcription. We have revised the relevant sentence in our manuscript to clarify this point and now refer to these GFP<sup>+</sup> cells as "cxcl18b lineage-traced MG" rather than "cxcl18b<sup>+</sup> MG" to avoid any misinterpretation (revised line 207).

      (6) Line 213: The statement that proliferative MG mostly originated from cxcl18b+ MG transitional states is a conclusion that does appear fully supported by the data. Whether those MG continue to express cxcl18b remains unanswered by the data in Figure 2 and would likely be inconsistent with the single-cell data in Figure 1.

      We thank the reviewer for this valuable comment. We agree that the original statement on Line 213 regarding the lineage relationship between cxcl18b⁺ transitional MG and proliferative MG required clarification.

      (1) The cxcl18b expression dynamics:

      Our single-cell RNA-seq and ISH analyses consistently show that cxcl18b expression peaks as early as 24 hpi and declines rapidly, with significantly reduced expression by 48 and 72 hpi. These findings suggest that cxcl18b marks an early transitional MG state, rather than being maintained in proliferative MG. Indeed, in our scRNA-seq pseudotime trajectory analysis (Figure 1H), cxcl18b expression is highest in early transitional MG clusters (Clusters 1) and downregulated as cells progress toward proliferative states (Clusters 3/6), supporting a model in which cxcl18b is downregulated before cell-cycle re-entry.

      (2) Prolonged stability of GFP protein:

      The GFP signal observed in Tg(cxcl18b: GFP) retinas at 72 hpi may be because of the prolonged stability of GFP protein, rather than sustained cxcl18b transcription. The actual expression dynamics of cxcl18b are more directly reflected by our in situ hybridization and single-cell RNA-seq data, both showing a rapid decline after its early peak at 24 hpi. This explanation is also noted in the manuscript (revised lines 196–197).

      (7) Line 246: The use of Dexamethasone to block inflammation is a widely used approach. However, dexamethasone is a broad-spectrum anti-inflammatory molecule that works through glucocorticoid signaling that may involve more than microglia. The observation that microglia recruitment and cxcl18a expression are both reduced is correlative but does not prove causation. Thus, the data are not sufficient to conclude that microglia-mediated inflammation is critical for activating cxcl18b expression. Indeed, data in Figure 1 indicate that cxcl18b expression occurs prior to microglia migration to the ONL.

      We thank the reviewer for this thoughtful and important comment. We fully acknowledge that dexamethasone is a broad-spectrum anti-inflammatory agent that acts via glucocorticoid receptor signaling and may influence multiple immune and non-immune pathways beyond microglia.

      In our study, dexamethasone treatment led to a reduction in both microglial recruitment and the number of cxcl18b<sup>+</sup> MG at 72 hpi, suggesting a potential association between inflammation and cxcl18b activation. However, we agree that this observation remains correlative and is not sufficient to establish a direct link between microglia activity and cxcl18b induction. Our time-course analysis indicates that cxcl18b expression peaks at 24 hpi, preceding robust microglial accumulation in the ONL, further highlighting the need to clarify the temporal dynamics and cellular sources of inflammatory cues.

      To address this question more conclusively, selective ablation of microglia during cone injury would be necessary. However, implementing such an approach would require a complex intersection of three transgenic lines—Tg(mpeg1: nfsB-mCherry) for microglia ablation, Tg(lws2: nfsB-mCherry) for cone ablation, and Tg(cxcl18b: GFP) for reporting—posing substantial genetic and experimental challenges.

      We have revised the Results section accordingly to state: “These findings suggest that microglia-mediated inflammation may contribute to the activation of cxcl18b-defined transitional states that precede MG proliferation, although a causal relationship remains to be established.” (revised lines 251–253). We also added a new paragraph in the “Result: Clonal analysis reveals injury-induced MG proliferation via cxcl18b-defined transitional states associated with inflammation” as “While dexamethasone suppressed both microglial recruitment and cxcl18b<sup>+</sup> MG generation, its broad anti-inflammatory action precludes definitive conclusions about microglial causality. Dissecting this relationship would require concurrent ablation of microglia and cone photoreceptors using a triple-transgenic strategy, which is beyond the scope of the current study. Targeted approaches will be necessary to resolve the specific role of microglia in initiating cxcl18b expression.” (revised lines 251–258) to explicitly acknowledge this limitation and the need for future studies using microglia-specific ablation models to resolve the mechanism.

      (8) Could the authors clarify the basis of investigating NO signaling, given the relative expression of the genes by either cxcl18b+ MG or uninjured MG? Based on the expression illustrated in Supplemental Figure 4A, there is almost no expression of nos1 or nos2b in any MG. The authors are encouraged to revisit the earlier single-cell data sets to identify those cells that express components of NO signaling to determine the source(s) of NO that could be impacting the Muller glia.

      We thank the reviewer for raising these important points.

      Nitric oxide (NO) signaling has been implicated in the regeneration of multiple zebrafish tissues, including the heart (Rochon et al., 2020; Yu et al., 2024), spinal cord (Bradley et al., 2010), and fin (Matrone et al., 2021). Based on these findings, we hypothesized that NO signaling might also contribute to retinal regeneration.

      As described in the manuscript, we compiled a redox-related gene list and systematically screened their roles in injury-induced MG proliferation using CRISPR-Cas9-mediated gene disruption. Among the candidates, disruption of nos genes significantly reduced the number of PCNA<sup>+</sup> MG cells following G/R cone ablation (Figure 4), prompting us to further investigate the role of NO signaling.

      (9) Line 319-320: this sentence appears to be missing text as "while no influenced across the nos mutants and gsnor mutants" does not make sense.

      We appreciate the reviewer’s observation and agree that the original sentence was unclear. We have revised the sentence in the manuscript as follows:

      “In contrast, no significant change in MG proliferation was observed in nos1, nos2a, or gsnor mutants compared to wild type (Figures 4F–4I)” (revised lines 326-328).

      (10) Line 326-328: The text should be rewritten as the current meaning would suggest there was no significant loss of photoreceptors in the nos2b mutants. That is incorrect. Rather, there was no significant difference between WT and the nos2b mutants in the number of photoreceptors lost at 72 hpi following MTZ treatment. Both groups lost photoreceptors, but the number lost in nos2b hets and homozygotes was the same as WT.

      We agree with the suggestion and have revised our manuscript. We have revised the sentence in the manuscript as follows:

      “We observed no significant difference in the loss of cone photoreceptor at 72 hpi between nos2b mutants and WT, indicating that the reduced MG proliferation observed in nos2b mutants is independent of the injury (WT: 45 ± 8 remaining cones, n = 24; nos2b⁺/⁻: 49 ± 12, n = 20; nos2b⁻/⁻: 46 ± 9, n = 20; mean ± SEM) (Figure 4K).” (revised lines 331-335).

      (11) There is concern over the inconsistencies with some of the data. In Figure 4, Supplement 1A, the single-cell data found virtually no expression of nos2b in either uninjured MG or cxcl18b+ MG. In contrast, the authors find nos2b expression by RT-PCR in the cxcl18b:GFP+ MG. The in situ expression of nos2b in Figure 5 - Supplement 1 is not persuasive. The red puncta are seen in a single cxcl18b:GFP+ cell but also in the plexiform layer and is other non cxcl18b:GFP+ cells.

      We appreciate the concern regarding the apparent inconsistencies in nos2b expression across different datasets. We provide the following explanations:

      (1) Low expression of nos2b in scRNA-seq data:

      We propose a potential explanation: Nitric oxide (NO) signaling is known to exert its biological functions in a dose-dependent manner and is tightly regulated post-transcriptionally, especially in inducible nitric oxide synthase (iNOS) (Bogdan, 2001; Nathan and Xie, 1994; Thomas et al., 2008). Thus, even modest changes in nos2b expression may exert meaningful biological effects without producing strong transcriptional signals detectable by scRNA-seq, which could fall below the detection threshold of scRNA-seq methods. Supporting this idea, our functional assay (Figure 4J) reveals a clear concentration-dependent effect of NO on MG proliferation, consistent with the biological relevance of Nos2b activity despite its low transcript abundance.

      (2) Regarding the in situ hybridization data:

      We used both commercially available in situ hybridization probes from (HCR<sup>TM</sup>) and RNAscope<sup>TM</sup> (data not shown) to detect nos2b transcripts. While the nos2b signal was observed in other retinal cell types, including cells in the plexiform layer, our primary study was focused on examining its expression within the cxcl18b<sup>+</sup> MG lineage.

      (3) Regarding RT-PCR detection of nos2b in cxcl18b: GFP<sup>+</sup> MG:

      To enhance detection sensitivity, we enriched cxcl18b: GFP<sup>+</sup> MG by FACS at 72 hpi and performed cDNA amplification before RT-PCR. This approach allowed the detection of low-abundance transcripts such as nos2b. It is also important to note that RT-PCR reflects fold changes in expression compared to MG to other retina cell type. The subtle but biologically upregulated of nos2b expression may not be readily captured by in situ hybridization or scRNA-seq.

      (12) Line 356 - there is a disagreement over the interpretation of the current data. The statement that nos2b was specifically expressed in cxcl18b+ transitional MG states is not entirely accurate. This conclusion is based on expression of GFP from a cxcl18b promoter, which may reflect persistence of the GFP protein and not evidence of cxcl18b expression. Even assuming that the nos2b in situ hybridization and RT-PCR data are correct, the data would indicate that nos2b is expressed in proliferating MG that are derived from the cxcl18b+ transitional states. The single-cell trajectory analysis in Figure 2 indicates that cxcl18b is not co-expressed with PCNA. Furthermore, the single-cell data in Figure 4, Supplement 1, indicates no expression of nos2b in cxcl18b+ MG. The authors need to reconcile these seemingly contradictory pieces of data.

      We thank the reviewer for this thoughtful and important comment. We agree that clarification is needed to accurately interpret the relationship between cxcl18b, nos2b, and MG proliferation, particularly considering the different temporal and technical contexts of our datasets.

      (1) Lineage labeling and interpretation of GFP expression:

      We acknowledge that in the Tg(cxcl18b: Cre-vhmc: mcherry::ef1a: loxP-dsRed-loxP-EGFP::lws2: nfsb-mCherry) line, GFP expression reflects historical activity of the cxcl18b promoter, rather than ongoing transcription. This GFP signal, due to its prolonged stay, may persist beyond the time window of endogenous cxcl18b expression. Accordingly, we have revised the manuscript to replace “cxcl18b⁺ MG” with “cxcl18b⁺ lineage-traced MG” throughout the relevant sections to prevent potential misinterpretation.

      (2) Functional experiments support a lineage relationship between cxcl18b⁺ states and nos2b activity:

      To further investigate the regulatory relationship between cxcl18b and nos2b, we conducted NO scavenging experiments using C-PTIO in the Tg(cxcl18b: GFP) background. We observed that the generation of cxcl18b: GFP⁺ MG after injury was not affected by NO depletion, indicating that cxcl18b activation precedes NO signaling (data not shown). However, PCNA⁺ MG was significantly reduced under the same treatment, suggesting that NO signaling is not required for cxcl18b⁺ transitional state formation, but is necessary for proliferation. Together with our MG-specific nos2b knockout data, these results support a model in which nos2b-derived NO acts downstream of the cxcl18b⁺ transitional state to promote MG cell-cycle re-entry.

      (3) The scRNA-seq data with nos2b expression:

      We agree with the reviewer that our scRNA-seq dataset shows minimal overlap between cxcl18b and pcna expression, which is consistent with our interpretation that cxcl18b expression marks a transitional phase before cell-cycle entry. Furthermore, nos2b transcripts were not robustly detected in cxcl18b⁺ MG clusters in our scRNA dataset. This discrepancy may be caused by technical limitations of scRNA-seq in capturing low-abundance or transient transcripts such as nos2b, as discussed in response to comment #11.

      (13) The data in Figure 7 are interesting and suggest a link between NO signaling and notch activity. The use of the C-PTIO NO scavenger is not specific to MG, which limits the conclusions related to autocrine NO signaling in cxcl18b+ MG.

      We acknowledge that the use of C-PTIO cannot distinguish between NO signaling within MG and paracrine effects from other retinal cells. Currently, technical limitations prevent MG-specific NO depletion. We have discussed this limitation accordingly in our revised “Limitations of this study” section (revised lines 540-545: “2. While our data suggest that injury-induced NO suppresses Notch signaling activation and promotes MG proliferation, the use of a general NO scavenger (C-PTIO) does not allow us to determine whether this regulation occurs in an autocrine or paracrine manner. The specific role of NO signaling within cxcl18b⁺ MG requires further validation using MG-specific NO depletion.”)

      (14) Line 446-448. As mentioned before, the data do not support a causative link between microglia recruitment and cxcl18b induction. More specifically, dexamethasone is a broad-spectrum anti-inflammatory drug that blocks microglia activation and recruitment. Critically, the authors demonstrate that expression of cxcl18b occurs prior to microglia recruitment (see Figure 1, Supplement 1). Thus, the statement that cxcl18b induction depends on microglia recruitment is not accurate.

      We thank the reviewer for reiterating this important point. We fully agree that the current data do not support a direct causal relationship between microglia recruitment and cxcl18b induction. As also addressed in our response to Comment 7, dexamethasone, as a broad-spectrum anti-inflammatory agent, cannot distinguish microglia-specific effects from those of other immune components. We have revised the text in revised lines 251–258 to clarify that microglia-mediated inflammation is associated with—but not required for—activation of cxcl18b-defined transitional MG states.

      Reference:

      Bogdan, C. (2001). Nitric oxide and the immune response. Nature immunology 2, 907-916.

      Bradley, S., Tossell, K., Lockley, R., and McDearmid, J.R. (2010). Nitric oxide synthase regulates morphogenesis of zebrafish spinal cord motoneurons. The Journal of neuroscience : the official journal of the Society for Neuroscience 30, 16818-16831.

      Gorsuch, R.A., Lahne, M., Yarka, C.E., Petravick, M.E., Li, J., and Hyde, D.R. (2017). Sox2 regulates Müller glia reprogramming and proliferation in the regenerating zebrafish retina via Lin28 and Ascl1a. Experimental eye research 161, 174-192.

      Hamon, A., García-García, D., Ail, D., Bitard, J., Chesneau, A., Dalkara, D., Locker, M., Roger, J.E., and Perron, M. (2019). Linking YAP to Müller Glia Quiescence Exit in the Degenerative Retina. Cell reports 27, 1712-1725.e1716.

      Iribarne, M., and Hyde, D.R. (2022). Different inflammation responses modulate Müller glia proliferation in the acute or chronically damaged zebrafish retina. Frontiers in cell and developmental biology 10, 892271.

      Jin, D., Ni, T.T., Hou, J., Rellinger, E., and Zhong, T.P. (2009). Promoter analysis of ventricular myosin heavy chain (vmhc) in zebrafish embryos. Developmental dynamics : an official publication of the American Association of Anatomists 238, 1760-1767.

      Krylov, A., Yu, S., Veen, K., Newton, A., Ye, A., Qin, H., He, J., and Jusuf, P.R. (2023). Heterogeneity in quiescent Müller glia in the uninjured zebrafish retina drive differential responses following photoreceptor ablation. Frontiers in molecular neuroscience 16, 1087136.

      Lahne, M., Nagashima, M., Hyde, D.R., and Hitchcock, P.F. (2020). Reprogramming Müller Glia to Regenerate Retinal Neurons. Annual review of vision science 6, 171-193.

      Lee, M.S., Jui, J., Sahu, A., and Goldman, D. (2024). Mycb and Mych stimulate Müller glial cell reprogramming and proliferation in the uninjured and injured zebrafish retina. Development (Cambridge, England) 151.

      Lourenço, R., Brandão, A.S., Borbinha, J., Gorgulho, R., and Jacinto, A. (2021). Yap Regulates Müller Glia Reprogramming in Damaged Zebrafish Retinas. Frontiers in cell and developmental biology 9, 667796.

      Matrone, G., Jung, S.Y., Choi, J.M., Jain, A., Leung, H.E., Rajapakshe, K., Coarfa, C., Rodor, J., Denvir, M.A., Baker, A.H., et al. (2021). Nuclear S-nitrosylation impacts tissue regeneration in zebrafish. Nat Commun 12, 6282.

      Mazzolini, J., Le Clerc, S., Morisse, G., Coulonges, C., Kuil, L.E., van Ham, T.J., Zagury, J.F., and Sieger, D. (2020). Gene expression profiling reveals a conserved microglia signature in larval zebrafish. Glia 68, 298-315.

      Meyers, J.R., Hu, L., Moses, A., Kaboli, K., Papandrea, A., and Raymond, P.A. (2012). β-catenin/Wnt signaling controls progenitor fate in the developing and regenerating zebrafish retina. Neural development 7, 30.

      Nagashima, M., and Hitchcock, P.F. (2021). Inflammation Regulates the Multi-Step Process of Retinal Regeneration in Zebrafish. Cells 10.

      Nathan, C., and Xie, Q.W. (1994). Nitric oxide synthases: roles, tolls, and controls. Cell 78, 915-918.

      Pollak, J., Wilken, M.S., Ueki, Y., Cox, K.E., Sullivan, J.M., Taylor, R.J., Levine, E.M., and Reh, T.A. (2013). ASCL1 reprograms mouse Muller glia into neurogenic retinal progenitors. Development (Cambridge, England) 140, 2619-2631.

      Rochon, E.R., Missinato, M.A., Xue, J., Tejero, J., Tsang, M., Gladwin, M.T., and Corti, P. (2020). Nitrite Improves Heart Regeneration in Zebrafish. Antioxidants & redox signaling 32, 363-377.

      Sarich, S.C., Sreevidya, V.S., Udvadia, A.J., Svoboda, K.R., and Gutzman, J.H. (2025). The transcription factor Jun is necessary for optic nerve regeneration in larval zebrafish. PloS one 20, e0313534.

      Sifuentes, C.J., Kim, J.W., Swaroop, A., and Raymond, P.A. (2016). Rapid, Dynamic Activation of Müller Glial Stem Cell Responses in Zebrafish. Investigative ophthalmology & visual science 57, 5148-5160.

      Svahn, A.J., Graeber, M.B., Ellett, F., Lieschke, G.J., Rinkwitz, S., Bennett, M.R., and Becker, T.S. (2013). Development of ramified microglia from early macrophages in the zebrafish optic tectum. Developmental neurobiology 73, 60-71.

      Thomas, D.D., Ridnour, L.A., Isenberg, J.S., Flores-Santana, W., Switzer, C.H., Donzelli, S., Hussain, P., Vecoli, C., Paolocci, N., Ambs, S., et al. (2008). The chemical biology of nitric oxide: implications in cellular signaling. Free radical biology & medicine 45, 18-31.

      Thomas, J.L., Ranski, A.H., Morgan, G.W., and Thummel, R. (2016). Reactive gliosis in the adult zebrafish retina. Experimental eye research 143, 98-109.

      Wan, J., and Goldman, D. (2016). Retina regeneration in zebrafish. Current opinion in genetics & development 40, 41-47.

      White, D.T., Sengupta, S., Saxena, M.T., Xu, Q., Hanes, J., Ding, D., Ji, H., and Mumm, J.S. (2017). Immunomodulation-accelerated neuronal regeneration following selective rod photoreceptor cell ablation in the zebrafish retina. Proceedings of the National Academy of Sciences of the United States of America 114, E3719-e3728.

      Yao, K., Qiu, S., Tian, L., Snider, W.D., Flannery, J.G., Schaffer, D.V., and Chen, B. (2016). Wnt Regulates Proliferation and Neurogenic Potential of Müller Glial Cells via a Lin28/let-7 miRNA-Dependent Pathway in Adult Mammalian Retinas. Cell reports 17, 165-178.

      Yin, Z., Kang, J., Xu, H., Huo, S., and Xu, H. (2024). Recent progress of principal techniques used in the study of Müller glia reprogramming in mice. Cell regeneration (London, England) 13, 30.

      Yu, C., Li, X., Ma, J., Liang, S., Zhao, Y., Li, Q., and Zhang, R. (2024). Spatiotemporal modulation of nitric oxide and Notch signaling by hemodynamic-responsive Trpv4 is essential for ventricle regeneration. Cellular and molecular life sciences : CMLS 81, 60.

    1. Author response:

      Reviewer #1 (Public Review):

      Lai and Doe address the integration of spatial information with temporal patterning and genes that specify cell fate. They identify the Forkhead transcription factor Fd4 as a lineage-restricted cell fate regulator that bridges transient spatial transcription factors to terminal selector genes in the developing Drosophila ventral nerve cord. The experimental evidence convincingly demonstrates that Fd4 is both necessary for lateborn NB7-1 neurons, but also sufficient to transform other neural stem cell lineages toward the NB7-1 identity. This work addresses an important question that will be of interest to developmental neurobiologists: How can cell identities defined by initial transient developmental cues be maintained in the progeny cells, even if the molecular mechanism remains to be investigated? In addition, the study proposes a broader concept of lineage identity genes that could be utilized in other lineages and regions in the Drosophila nervous system and in other species. 

      Thanks for the accurate summary and positive comments!

      While the spatial factors patterning the neuroepithelium to define the neuroblast lineages in the Drosophila ventral nerve cord are known, these factors are sometimes absent or not required during neurogenesis. In the current work, Lai and Doe identified Fd4 in the NB7-1 lineage that bridges this gap and explains how NB7-1 neurons are specified after Engrailed (En) and Vnd cease their expression. They show that Fd4 is transiently co-expressed with En and Vnd and is present in all nascent NB7-1 progenies. They further demonstrate that Fd4 is required for later-born NB7-1 progenies and sufficient for the induction of NB7-1 markers (Eve and Dbx) while repressing markers of other lineages when force-expressed in neural progenitors, e.g., in the NB56 lineage and in the NB7-3 lineage. They also demonstrate that, when Fd4 is ectopically expressed in NB7-3 and NB5-6 lineages, this leads to the ectopic generation of dorsal muscle-innervating neurons. The inclusion of functional validation using axon projections demonstrates that the transformed neurons acquire appropriate NB7-1 characteristics beyond just molecular markers. Quantitative analyses are thorough and well-presented for all experiments.

      Thanks for the positive comments!

      (1) While Fd4 is required and sufficient for several later-born NB7-1 progeny features, a comparison between early-born (Hb/Eve) and later-born (Run/Eve) appears missing for pan-progenitor gain of Fd4 (with sca-Gal4; Figure 4) and for the NB7-3 lineage (Figure 6). Having a quantification for both could make it clearer whether Fd4 preferentially induces later-born neurons or is sufficient for NB7-1 features without temporal restriction.

      We quantified the percentage of Hb+ and Runt+ cells among Eve+ cells with sca-gal4, and the results are shown in Figure 4-figure supplement 1. We found that the proportion of early-born cells is slightly reduced but the proportion of later-born cells remain similar. Interestingly, we also found a subset of Eve+ cells with a mixed fate (Hb+Runt+) but the reason remains unclear.

      (2) Fd4 and Fd5 are shown to be partially redundant, as Fd4 loss of function alone does not alter the number of Eve+ and Dbx+ neurons. This information is critical and should be included in Figure 3.

      Because every hemisegment in an fd4 single mutant is normal, we just added it as the following text: “In fd4 mutants, we observe no change in the number of Eve+ neurons or Dbx+ neurons (n=40 hemisegments).”

      (3) Several observations suggest that lineage identity maintenance involves both Fd4dependent and Fd4-independent mechanisms. In particular, the fact that fd4-Gal4 reporter remains active in fd4/fd5 mutants even after Vnd and En disappear indicates that Fd4's own expression, a key feature of NB7-1 identity, is maintained independently of Fd4 protein. This raises questions about what proportion of lineage identity features require Fd4 versus other maintenance mechanisms, which deserves discussion.

      We agree, thanks for raising this point. We add the following text to the Discussion. “Interestingly, the fd4 fd5 mutant maintains expression of fd4:gal4, suggesting that the fd4/fd5 locus may have established a chromatin state that allows “permanent” expression in the absence of Vnd, En, and Fd4/Fd5 proteins.”

      (4) Similarly, while gain of Fd4 induces NB7-1 lineage markers and dorsal muscle innervation in NB5-6 and NB7-3 lineages, drivers for the two lineages remain active despite the loss of molecular markers, indicating some regulatory elements retain activity consistent with their original lineage identity. It is therefore important to understand the degree of functional conversion in the gain-of-function experiments. Sparse labeling of Fd4 overexpressing NB5-6 and NB7-3 progenies, as was done in Seroka and Doe (2019), would be an option.

      We agree it is interesting that the NB7-3 and NB5-6 drivers remain on following Fd4 misexpression. To explore this, we used sca-gal4 to overexpress Fd4 and observed that Lbe expression persisted while Eg was largely repressed (see Author response image 1 below). The results show that Lbe and Eg respond differently to Fd4. A non-mutually exclusive possibility is that the continued expression of lbe-Gal4 UAS-GFP or eg-Gal4 UAS-GFP may be due to the lengthy perdurance of both Gal4 and GFP.

      Author response image 1.

      (5) The less-penetrant induction of Dbx+ neurons in NB5-6 with Fd4-overexpression is interesting. It might be worth the authors discussing whether it is an Fd4 feature or an NB56 feature by examining Dbx+ neuron number in NB7-3 with Fd4-overexpression.

      In the NB7-3 lineages misexpressing Fd4, only 5 lineages generated Dbx+ cells (0.1±0.4, n=64 hemisegments), suggesting that the low penetrance of Dbx+ induction is an intrinsic feature of Fd4 rather than lineage context. We have added this information in the results section. 

      (6) It is logical to hypothesize that spatial factors specify early-born neurons directly, so only late-born neurons require Fd4, but it was not tested. The model would be strengthened by examining whether Fd4-Gal4-driven Vnd rescues the generation of laterborn neurons in fd4/fd5 mutants.

      When we used en-gal4 driver to express UAS-vnd in the fd4/fd5 mutant background, we found an average 7.4±2.2 Eve+ cells per hemisegment (n=36), significantly higher than fd4/fd5 mutant alone (3.9±0.8 cells, n=52, p=2.6x10<sup.-11</sup>) (Figure 3J). In addition, 0.2±0.5 Eve+ cells were ectopic Hb+ (excluding U1/U2), indicating that Vnd-En integration is sufficient to generate both early-born and late-born Eve+ cells in the fd4/fd5 mutants. We have added the results to the text.

      (7) It is mentioned that Fd5 is not sufficient for the NB7-1 lineage identity. The observation is intriguing in how similar regulators serve distinct roles, but the data are not shown. The analysis in Figure 4 should be performed for Fd5 as supplemental information.

      Thanks for the suggestion. Because the results are exactly the same as the wild type, we don’t think it is necessary to provide an additional images or analysis as supplemental information.

      Reviewer #2 (Public review):

      Via a detailed expression analysis, they find that Fd4 is selectively expressed in embryonic NB7-1 and newly born neurons within this lineage. They also undertake a comprehensive genetic analysis to provide evidence that fd4 is necessary and sufficient for the identity of NB7-1 progeny. 

      Thanks for the accurate summary!

      The analysis is both careful and rigorous, and the findings are of interest to developmental neurobiologists interested in molecular mechanisms underlying the generation of neuronal diversity. Great care was taken to make the figures clear and accessible. This work takes great advantage of years of painstaking descriptive work that has mapped embryonic neuroblast lineages in Drosophila. 

      Thanks for the positive comments!

      The argument that Fd4 is necessary for NB7-1 lineage identity is based on a Fd4/Fd5 double mutant. Loss of fd4 alone did not alter the number of NB7-1-derived Eve+ or Dbx+ neurons. The authors clearly demonstrate redundancy between fd4 and fd5, and the fact that the LOF analysis is based on a double mutant should be better woven through the text.

      The authors generated an Fd5 mutant. I assume that Fd5 single mutants do not display NB7-1 lineage defects, but this is not stated. The focus on Fd4 over Fd5 is based on its highly specific expression profile and the dramatic misexpression phenotypes. But the LOF analysis demonstrates redundancy, and the conclusions in the abstract and through the results should reflect the existence of Fd5 in the conclusions of this manuscript.

      We agree, and have added new text to clarify the single mutant phenotypes (there are none) and the double mutant phenotype (loss of NB7-1 molecular and morphological features. The following text is added to the manuscript: “Not surprisingly, we found that fd4 single mutants or fd5 single mutants had no phenotype (Eve+ neurons were all normal). Thus, to assess their roles, we generated a fd4 and fd5 double mutant. Because many Eve+ and Dbx+ cells are generated outside of NB7-1 lineage, it was also essential to identify the Eve+ or Dbx+ cells within NB7-1 lineage in wild type and fd4 mutant embryos. To achieve this, we replaced the open reading frame of fd4 with gal4 (called fd4-gal4) (see Methods); this stock simultaneously knocked out both fd4 and fd5 (called fd4/fd5 mutant hereafter) while specifically labeling the NB7-1 lineage. For the remainder of this paper we use the fd4/fd5 double mutant to assay for loss of function phenotypes.”

      It is notable that Fd4 overexpression can rewire motor circuits. This analysis adds another dimension to the changes in transcription factor expression and, importantly, demonstrates functional consequences. Could the authors test whether U4 and U5 motor axon targeting changes in the fd4/fd5 double mutant? To strengthen claims regarding the importance of fd4/fd5 for lineage identity, it would help to address terminal features of U motorneuron identity in the LOF condition.

      Thanks for raising this important point. We examined the axon targeting on body wall muscles in both wild type and in fd4/fd5 mutant background and added the results in Figure 3-figure supplement 2. We found that the axon targeting in the late-born neuron region (LL1) is significantly reduced, suggesting that the loss of late-born neurons in fd4/fd5 mutant leads to the absence of innervation of corresponding muscle targets.

      Reviewer #3 (Public review):

      The goal of the work is to establish the linkage between the spatial transcription factors (STFs) that function transiently to establish the identities of the individual NBs and the terminal selector genes (typically homeodomain genes) that appear in the newborn postmitotic neurons. How is the identity of the NB maintained and carried forward after the spatial genes have faded away? Focusing on a single neuroblast (NB 7-1), the authors present evidence that the fork-head transcription factor, fd4, provides a bridge linking the transient spatial cues that initially specified neuroblast identity with the terminal selector genes that establish and maintain the identity of the stem cell's progeny. 

      Thanks for the positive comments!

      The study is systematic, concise, and takes full advantage of 40+ years of work on the molecular players that establish neuronal identities in the Drosophila CNS. In the embryonic VNC, fd4 is expressed only in the NB 7-1 and its lineage. They show that Fd4 appears in the NB while the latter is still expressing the Spatial Transcription Factors and continues after the expression of the latter fades out. Fd4 is maintained through the early life of the neuronal progeny but then declines as the neurons turn on their terminal selector genes. Hence, fd4 expression is compatible with it being a bridging factor between the two sets of genes. 

      Thanks for the accurate summary!

      Experimental support for the "bridging" role of Fd4 comes from a set of loss-of-function and gain-of-function manipulations. The loss of function of Fd4, and the partially redundant gene Fd5, from lineage 7-1 does not aoect the size of the lineage, but terminal markers of late-born neuronal phenotypes, like Eve and Dbx, are reduced or missing. By contrast, ectopic expression of fd4, but not fd5, results in ectopic expression of the terminal markers eve and Dbx throughout diverse VNC lineages. 

      Thanks for the accurate summary!

      A detailed test of fd4's expression was then carried out using lineages 7-3 and 5-6, two well-characterized lineages in Drosophila. Lineage 7-3 is much smaller than 7-1 and continues to be so when subjected to fd4 misexpression. However, under the influence of ectopic Fd4 expression, the lineage 7-3 neurons lost their expected serotonin and corazonin expression and showed Eve expression as well as motoneuron phenotypes that partially mimic the U motoneurons of lineage 7-1.

      Thanks for the positive comments!

      Ectopic expression of Fd4 also produced changes in the 5-6 lineage. Expression of apterous, a feature of lineage 5-6, was suppressed, and expression of the 7-1 marker, Eve, was evident. Dbx expression was also evident in the transformed 5-6 lineages, but extremely restricted as compared to a normal 7-1 lineage. Considering the partial redundancy of fd4 and fd5, it would have been interesting to express both genes in the 5-6 lineage. The anatomical changes that are exhibited by motoneurons in response to Fd4 expression confirm that these cells do, indeed, show a shift in their cellular identity.

      We appreciate the positive comments. We agree double misexpression of Fd4 and Fd5 might give a stronger phenotype (as the reviewer says) but the lack of this experiment does not change the conclusions that Fd4 can promote NB7-1 molecular and morphological aspects at the expense of NB5-6 molecular markers.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The study introduces an open-source, cost-effective method for automating the quantification of male social behaviors in Drosophila melanogaster. It combines machine-learning-based behavioral classifiers developed using JAABA (Janelia Automatic Animal Behavior Annotator) with inexpensive hardware constructed from off-the-shelf components. This approach addresses the limitations of existing methods, which often require expensive hardware and specialized setups. The authors demonstrate that their new "DANCE" classifiers accurately identify aggression (lunges) and courtship behaviors (wing extension, following, circling, attempted copulation, and copulation), closely matching manually annotated groundtruth data. Furthermore, DANCE classifiers outperform existing rule-based methods in accuracy. Finally, the study shows that DANCE classifiers perform as well when used with low-cost experimental hardware as with standard experimental setups across multiple paradigms, including RNAi knockdown of the neuropeptide Dsk and optogenetic silencing of dopaminergic neurons.

      The authors make creative use of existing resources and technology to develop an inexpensive, flexible, and robust experimental tool for the quantitative analysis of Drosophila behavior. A key strength of this work is the thorough benchmarking of both the behavioral classifiers and the experimental hardware against existing methods. In particular, the direct comparison of their low-cost experimental system with established systems across different experimental paradigms is compelling.

      While JAABA-based classifiers have been previously used to analyze aggression and courtship (Tao et al., J. Neurosci., 2024; Sten et al., Cell, 2023; Chiu et al., Cell, 2021; Isshi et al., eLife, 2020; Duistermars et al., Neuron, 2018), the demonstration that they work as well without expensive experimental hardware opens the door to more low-cost systems for quantitative behavior analysis.

      We thank the reviewer for their positive assessment and constructive suggestions. We have cited these additional JAABA studies in the Introduction. We clarified that several prior JAABA-based classifiers were developed using specialized machinevision cameras or custom setups, and that in some cases the original code and classifiers were not made publicly available, which limits reproducibility and wider adoption. To address this, we explicitly note in the revised manuscript that DANCE was developed with accessibility in mind.

      Although the study provides a detailed evaluation of DANCE classifier performance, its conclusions would be strengthened by a more comprehensive analysis. The authors assess classifier accuracy using a bout-level comparison rather than a frame-level analysis, as employed in previous studies (Kabra et al., Nat Methods, 2013). They define a true positive as any instance where a DANCE-detected bout overlaps with a manually annotated ground-truth bout by at least one frame. This criterion may inflate true positive rates and underestimate false positives, particularly for longer-duration courtship behaviors. For example, a 15-second DANCE-classified wing extension bout that overlaps with ground truth for only one frame would still be considered a true positive. A frame-level analysis performance would help address this possibility.

      We thank the reviewer for raising this important point. Our original use of bout-level analysis followed existing literature (Duistermars et al., 2018; Ishii et al., 2020; Chiu et al., 2021; Tao et al., 2024; Hindmarsh Sten et al., 2025). While our lunge classifier already operates at the frame level, we have now performed additional frame-level evaluations for the duration based courtship classifiers. These analyses revealed only minor differences in precision, recall, and F1 scores compared with the original bout-level approach (see new Figure 5—Figure Supplement 3). Details of this analysis are now included in the Materials and Methods.

      In summary, this work provides a practical and accessible approach to quantifying Drosophila behavior, reducing the economic barriers to the study of the neural and molecular mechanisms underlying social behavior.

      We thank the reviewer for their encouraging comments and for recognizing the accessibility and practical value of our approach. We appreciate the constructive suggestions, which have helped strengthen the manuscript.

      Reviewer #2 (Public review):

      Summary:

      This manuscript addresses the development of a low-cost behavioural setup and standardised open-source high-performing classifiers for aggression and courtship behaviour. It does so by using readily available laboratory equipment and previously developed software packages. By comparing the performance of the setup and the classifiers to previously developed ones, this study shows the classifier's overperformance and the reliability of the low-cost setup in recapitulating previously described effects of different manipulations on aggression and courtship.

      Strengths:

      The newly developed classifiers for lunges, wing extension, attempted copulation, copulation, following, and circling, perform better than available previously developed ones. The behavioural setup developed is low cost and reliably allows analysis of both aggression and courtship behaviour, validated through social experience manipulation (social isolation), gene knock (Dsk in Dilp2 neurons) and neuronal inactivation (dopaminergic neurons) known to affect courtship and aggression.

      We thank the reviewer for the clear summary of our work and for highlighting its strengths. We appreciate these positive comments and suggestions, which have helped improve the clarity of the manuscript.

      Weaknesses:

      Aggression encompasses multiple defined behaviours, yet only lunges were analysed. Moreover, the CADABRA software to which DANCE was compared analyses further aggression behaviours, making their comparisons incomplete. In addition, though DANCE performs better than CADABRA and Divider in classifying lunges in the behavioural setup tested, it did not yield very high recall and F1 scores.

      We thank the reviewer for raising this important point. We focused on lunges because they are widely used as a standard proxy for male aggression across multiple laboratories (Agrawal et al., 2020; Asahina et al., 2014; Chiu et al., 2021; Chowdhury et al., 2021; Dierick et al., 2007; Hoyer et al., 2008; Jung et al., 2020; Nilsen et al., 2004; Watanabe et al., 2017). As noted in the Discussion, our study also provides a template for the future development of additional aggression classifiers (fencing, wing flick, tussle, chase, female headbutt) and courtship classifiers (tapping, licking, rejection), which can be trained and shared through the same DANCE framework. Developing and validating these was beyond the scope of the present work.

      To address the concern regarding precision, recall, and F1 scores, we performed additional analyses across all training videos and compiled these results in the new Figure 2—Figure Supplement 2. Our earlier lunge classifier had performance metrics obtained after training on a total of 11 videos. Our analysis shows performance metrics for classifiers trained on four independent datasets (Videos 8– 11). We found that the classifier trained on nine videos provided the best balance of precision, recall, and F1 (78.73%, 73.07%, and 75.79%, respectively), which was slightly better than the earlier classifier. We therefore updated the main figure, text, and Materials and Methods to use this version and uploaded the corresponding classifier and training details to the GitHub repository. 

      DANCE is of limited use for neuronal circuit-level enquiries, since mechanisms for intensity and temporally controlled optogenetic manipulations, which are nowadays possible with open-source software and low-cost hardware, were not embedded in its development.

      We thank the reviewer for this valuable point. The primary aim of DANCE is to provide an accessible, modular, and low-cost behavioural recording and analysis platform. It was designed so that users can readily integrate additional components such as optogenetic control when needed. As a proof of concept, we implemented optogenetic silencing of dopaminergic neurons using the DANCE hardware and confirmed that this manipulation increased aggression (Figure 7R). 

      To facilitate adoption, we now provide schematic diagrams, LED control code, and instructions on our GitHub page and setup photographs in the manuscript (see new Figure 7—Figure Supplement 1). The released code allows programmable timing and intensity control, enabling users to reproduce temporally precise optogenetic protocols or extend the system for other stimulation paradigms.

      Reviewer #3 (Public review):

      The preprint by Yadav et al. describes a new setup to quantify a number of aggression and mating behaviors in Drosophila melanogaster. The investigation of these behaviors requires the analysis of a large number of videos to identify each kind of behavior displayed by a fly. Several approaches to automatize this process have been published before, but each of them has its limitations. The authors set out to develop a new setup that includes very low-cost, easy-to-acquire hardware and open-source machine-learning classifiers to identify and quantify the behavior.

      Strengths:

      (1) The study demonstrates that their cheap, simple, and easy-to-obtain hardware works just as well as custom-made, specialized hardware for analyzing aggression and mating behavior. This enables the setup to be used in a wide range of settings, from research with limited resources to classroom teaching.

      (2) The authors used previously published software to train new classifiers for detecting a range of behaviors related to aggression and mating and to make them freely available. The classifiers are very positively benchmarked against a manually acquired ground truth as well as existing algorithms.

      (3) The study demonstrates the applicability of the setup (hardware and classifiers) to common methods in the field by confirming a number of expected phenotypes with their setup.

      We thank the reviewer for the positive assessment of our work and for highlighting its strengths. We appreciate these encouraging comments and suggestions, which have helped improve the clarity and presentation of the manuscript.

      Weaknesses:

      (1) When measuring the performance of the duration-based classifiers, the authors count any bout of behavior as true positive if it overlaps with a ground-truth positive for only 1 frame - despite the minimal duration of a bout is 10 frames, and most bouts are much longer. That way, true positives could contain cases that are almost totally wrong as long there was an overlap of a single frame. For the mating behaviors that are classified in ongoing bouts, I think performance should be evaluated based on the % of correctly classified frames, not bouts.

      We thank the reviewer for raising this concern. In response to this point, and to Reviewer #1’s similar comment, we performed a frame-level evaluation of all duration-based courtship classifiers. The analysis revealed only minor differences compared with the original bout-level metrics (see new Figure 5—Figure Supplement 3), confirming the robustness of our classifiers. We have also added a description of this analysis in the Materials and Methods section.

      (2) In the methods part, only one of the pre-existing algorithms (MateBook), is described. Given that the comparison with those algorithms is a so central part of the manuscript, each of them should be briefly explained and the settings used in this study should be described.

      We thank the reviewer for this helpful suggestion. In the revised manuscript, we expanded the Materials and Methods to include concise descriptions and parameter settings for all pre-existing algorithms used for comparison. This includes dedicated subsections for CADABRA and the Divider assay, with explicit reference to their rulebased or geometric features. For MateBook, we specified the persistence filters used and the adjustments made for fair benchmarking. These changes ensure transparency and reproducibility.

      Taken together, this work can greatly facilitate research on aggression and mating in Drosophila. The combination of low-cost, off-the-shelf hardware and open-source, robust software enables researchers with very little funding or technical expertise to contribute to the scientific process and also allows large-scale experiments, for example in classroom teaching with many students, or for systematic screenings.

      We thank the reviewer for the encouraging comments and for recognizing the accessibility and broad applicability of DANCE. We believe these revisions have further strengthened the manuscript.

      Reviewer #1 (Recommendations for the authors):

      The following comments highlight areas where additional context, clarification, or further analysis could strengthen the manuscript. I hope these suggestions will be useful in refining your work.

      (1) Lines 71-73: The authors state that Ctrax "leads to frequent identity switches among tracked flies, which is not the case while using FlyTracker." However, Ctrax was specifically designed to minimize identity errors, and Kabra et al. (2013) reported a low frequency of such errors-approximately one per five fly-hours in 10-fly videos. In contrast, Caltech FlyTracker does not correct identity errors automatically, requiring manual corrections, as noted in the Methods section of this study. If this is not an oversight, please provide further context to clarify this distinction.

      We thank the reviewer for raising this clarification. As reported by Bentzur et al. (2021), when groups of flies were tracked simultaneously, Ctrax often generated multiple identities for the same individual, sometimes producing more trajectories than the actual number of flies. To prevent ambiguity, we revised the text to read: “While both Ctrax and FlyTracker (Eyjolfsdottir et al., 2014) may produce identity switches, when groups of flies were tracked simultaneously, Ctrax led to inaccuracies that required manual correction using specialized algorithms such as FixTrax (Bentzur et al., 2021).”  We also quantified FlyTracker identity-switch rates in our datasets and report them in new Supplementary File 5, confirming that such events were rare (< 2% of tracked intervals). We believe, this updated version provides the necessary context and ensures accuracy in describing each tracker’s limitations.

      (2) Line 85: Providing additional context on how this study builds on previous work using JAABA-based classifiers for fly social behavior and comparing these classifiers to rule-based methods would more accurately situate it within the field. The authors state that "recently, a few JAABA-based classifiers have been developed for measuring aggression and courtship" and cite four related studies. However, this statement seems to underrepresent the use of JAABA-based classifiers for quantifying fly social behavior, which has become common in the field. Several additional studies (as noted in the public review) have developed JAABA-based classifiers for scoring aggression or courtship. Furthermore, other studies have compared the performance of JAABA-based classifiers with rule-based classifiers like CADABRA (e.g., Chowdhury et al., Comm Biology 2021; Leng et al., PlosOne 2020; Kabra et al., Nat Methods 2013). Mentioning the similar findings in those studies and your own helps strengthen the conclusion that machine-learning-based classifiers outperform rule-based classifiers in several experimental contexts.

      We thank the reviewer for this helpful suggestion. We have revised the Introduction to include additional references to studies that applied JAABA-based classifiers for aggression and courtship and made textual edits to reflect this. We further noted that, unlike several previous studies, all DANCE classifiers and analysis code are publicly available.

      Reviewer #2 (Recommendations for the authors):

      (1) Suggestions for improved or additional experiments, data or analyses: As mentioned in the description of the effect of optogenetic inactivation of dopaminergic neurons, in the conclusion and also reported in the literature, there are other important identified aggression behaviours, such as fencing, wing flick, tussle, and chase. Similarly, for courtship, tapping and licking have also been defined. This study, as opposed to proposed future studies, would benefit from creating opensource classifiers for these established behaviours, which are important for the analysis of aggression and courtship.

      We thank the reviewer for this valuable suggestion. As clarified in the Discussion, this manuscript intentionally focuses on six core, well-validated aggression and courtship behaviors to demonstrate DANCE’s modularity and reproducibility. Developing additional classifiers such as fencing, wing flick, tussle, chase, tapping, and licking would require extensive annotation and validation beyond the present scope. To address this point, we explicitly note in the revised text that the DANCE pipeline is readily extendable, allowing the community to build new classifiers within the same framework.

      In terms of observer bias assessment for ground-truthing in courtship, this was only presented for circling and it would be beneficial to have encompassed all behaviours analysed.

      We thank the reviewer for this suggestion. Observer-bias comparisons for all six classifiers are presented in Figure 2—Figure Supplement 1 (panels A–F). We clarified in the Results that annotations from two independent evaluators were compared for all classifiers, with no significant differences observed, confirming their robustness.

      Finally, intensity and temporal optogenetic control are important for neuronal circuit analysis of underlying behaviour. The authors could embed this aspect in DANCE by integrating control of the green light LED strip used in this study using, for example, the open-source visual reactive programming software Bonsai (Lopes et al., 2015) and open-source electronics platform Arduino. This is an important and valuable addition in line with maintaining low cost.

      We thank the reviewer for this valuable suggestion. DANCE was designed to be modular, allowing integration of temporal optogenetic control. To support immediate adoption, we now provide Arduino LED control code, setup schematics, and photographs (new Figure 7—Figure Supplement 1) along with step-by-step instructions on our GitHub page. We also note that Bonsai and Arduino frameworks are compatible with DANCE, enabling future extensions for closed-loop or behaviortriggered stimulation.

      (2) Minor corrections to the text and figures:

      Figure Supplement 1 refers only to Figure 2, yet panels D-F refer to the behaviour circling in courtship and therefore should be assigned to the respective figure.

      Thanks, we have corrected this.

      In lines 315-316, the cumbersome task of fluon coating for aggression assays seems to be ubiquitous across assays which is not the case, and therefore the sentence should include the word 'some'.

      Thanks, we have edited this.

      The cost of the phone and/or tablet should be included in the DANCE setup costs, as presumably these devices will be dedicated to the behavioural studies, for consistency purposes.

      We thank the reviewer for this comment. We intentionally did not include smartphones or tablets in the setup cost because, in our experiments, these devices were not dedicated exclusively to DANCE but were repurposed from routine personal use. Our aim was to leverage readily available consumer electronics so that their cost does not become a barrier to adoption. We confirmed that commonly available Android phones capable of 30 fps at 1080p in H.264 format, as well as tablets or phones running a simple white-screen light app, are sufficient for reliable behavior classification and illumination. Since these devices can be returned to regular use after recordings, including their cost in the setup would not accurately reflect the intended accessibility of DANCE. For consistency, we now clarify in the Materials and Methods that such devices should be placed in airplane mode during recordings.

      Reviewer #3 (Recommendations for the authors):

      (1) For my taste, the authors put too much emphasis on the point that their method outperforms existing methods. I understand the value in comparing to published methods and it is of course fully justified to state the advantages of the new method. But the whole preprint is set up as a competition with the old algorithms, and the conclusion that the new classifier is better is repeated in each figure caption and after each paragraph of the results. This competitive mindset also extends to the selection of which results are presented as main figures and which as supplements - all cases in which the previous methods actually perform well are only presented in the supplement. I think this is simply unnecessary as the authors' results speak for themselves, and do not need the continuous competitive comparison.

      We thank the reviewer for this thoughtful suggestion. Our intention was to benchmark DANCE rigorously against existing methods, not to frame the study competitively. We agree that repeated emphasis on relative performance was unnecessary. In the revised version, we streamlined figure captions and text throughout the manuscript to balance comparisons and removed redundant phrasing. Instances where other methods performed well are now presented with equal clarity to maintain a neutral and informative tone.

      (2) When describing the DANCE hardware, as a reader I would find it interesting to also read about potential issues that the authors encountered. For example, how difficult is it to handle the materials without breaking or deforming them, which could affect the behavioral assays? How critical is it to use specific blister packs - the availability of which will likely vary strongly between countries? Did the authors try different sizes, and products? Such information, even as a supplement, could be very helpful for the widespread use of the hardware.

      We thank the reviewer for this important point. To address this, we conducted additional tests comparing DANCE arenas of different diameters (new Figure 7— Figure Supplement 3A–C and new Figure 7—Figure Supplement 4A–L). We also consulted colleagues in multiple countries and verified that the blister packs used in our assays are readily available. The Materials and Methods now include practical handling notes: blister foils can be reused ~30–40 times for aggression assays and ~10–15 times for courtship assays before deformation. We also describe how to prevent agar surface damage during assembly and how to wash and dry the arenas for optimal reusability.

      (3) I find the arrows pointing to several videos in a number of figures rather distracting and redundant, and suggest omitting them.

      Thanks, we have omitted these arrows from all relevant figures and clarified the figure legends to enhance readability.

      (4) P8, line 169 ff: this is a very long sentence that should be separated into several sentences.

      We have rewritten this as follows: “DANCE scores remained comparable to groundtruth scores across all categories, whereas CADABRA and Divider underestimated the lunge counts (Figure 2B–E). Correlation analysis revealed a strong relationship between DANCE and ground-truth scores (Figure 2F, Supplementary File 2). In comparison, CADABRA and the Divider assay classifier showed a weaker correlation (Figure 2G-H, Supplementary File 2).”

      (5) P10, line 216: please explain, here and in the methods, how these behavioral indices are calculated. I did not find this information anywhere in the paper.

      We thank the reviewer for pointing this out. We now define the behavioral index explicitly in Materials and Methods: “For each assay, a behavioral index was calculated as the proportion of frames in which the male engaged in the specified behavior. This was obtained by dividing the total number of frames annotated for that behavior by the total number of frames in the recording.”

      (6) P11, line 253: I don't understand the modifications to MateBook regarding attempted copulations, neither in the results nor the methods section. I would ask the authors to explain more explicitly what was done.

      We thank the reviewer for this helpful suggestion. We have re-written several parts of the Materials and methods to clarify these details and streamline the text. To train the attempted copulation classifier, we combined datasets from assays with mated and decapitated virgin females, using manual annotations as ground truth. We also adapted MateBook’s persistence filters (Ribeiro et al., 2018) and defined thresholds explicitly: mounting lasting >45 s (>1350 frames at 30 fps) was defined as copulation, whereas abdominal curling without mounting, or mounting lasting 0.33– 45 s, was defined as attempted copulation.

      (7) Figure 7F: this is the only case with a significant difference between the two setups. What explanations do the authors have for the discrepancy?

      We thank the reviewer for raising this point. After repeating the experiments, we no longer found a significant difference between the setups. Figure 7 and its legend have been updated to reflect these results.

      (8) Figure 2 - Supplement 1: I do not understand why the boxes for Observer 1 have different colors in different figures. Does this have a meaning?

      Thanks for pointing this out. The color differences had no intended meaning, and we have corrected the figure for consistency across panels.

      (9) P22, line 517ff: It would be interesting to know how frequently identity switches occurred. For large-scale, automatic behavioral screenings that step could be a crucial bottleneck.

      We thank the reviewer for this valuable suggestion. We analyzed identity switches using the FlyTracker “Visualizer” package, which flags frames with possible overlaps or jumps. Flagged intervals were manually verified, and we report these data in new Supplementary File 5. Identity switch rates were very low: 0.66% for high-resolution recordings and 1.9% for smartphone DANCE videos in the most challenging decapitated-virgin dataset. These findings demonstrate robust tracking performance under both setups.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      Biomolecular condensates are an essential part of cellular homeostatic regulation. In this manuscript, the authors develop a theoretical framework for the phase separation of membrane-bound proteins. They show the effect of non-dilute surface binding and phase separation on tight junction protein organization. 

      Strengths: 

      It is an important study, considering that the phase separation of membrane-bound molecules is taking the center stage of signaling, spanning from immune signaling to cell-cell adhesion. A theoretical framework will help biologists to quantitatively interpret their findings. 

      Weaknesses: 

      Understandably, the authors used one system to test their theory (ZO-1). However, to establish a theoretical framework, this is sufficient. 

      We acknowledge this limitation. While we agree that additional systems would strengthen the generality of our theory, we note that the focus of this work is to introduce and validate a theoretical framework. As the reviewer notes, this is sufficient for establishing the framework. Nonetheless, we are open to further collaborations or future studies to test the model with other systems.

      Reviewer #2 (Public review): 

      Summary: 

      The authors present a clear expansion of biophysical (thermodynamic) theory regarding the binding of proteins to membrane-bound receptors, accounting for higher local concentration effects of the protein. To partially test the expanded theory, the authors perform in vitro experiments on the binding of ZO1 proteins to Claudin2 C-terminal receptors anchored to a supported lipid bilayer, and capture the effects that surface phase separation of ZO1 has on its adsorption to the membrane. 

      Strengths: 

      (1) The derived theoretical framework is consistent and largely well-explained. 

      (2) The experimental and numerical methodologies are transparent. 

      (3) The comparison between the best parameterized non-dilute theory is in reasonable agreement with experiments. 

      Weaknesses: 

      (1) In the theoretical section, what has previously been known, compared to which equations are new, should be made more clear. 

      We have revised the theory section to clearly distinguish previously established formulations from novel contributions following equation (4), which is .

      (2) Some assumptions in the model are made purely for convenience and without sufficient accompanying physical justification. E.g., the authors should justify, on physical grounds, why binding rate effects are/could be larger than the other fluxes. 

      For our problem, binding is relevant together with diffusive transport in each phase. Each process is accompanied by kinetic coefficients that we estimate for the experimental system. For the considered biological systems (and related ones), it is difficult to determine whether other fluxes (see, e.g., Eq. 8(e)) have relaxed or not. We note that their effects are, of course, included in the kinetic model applied to the coarsening of ZO1 surface condensates as boundary conditions. But we cannot exclude that the corresponding kinetic coefficient in the actual biological system is large enough such that, e.g., Eq. (9e) does not vanish to zero “quasi-statically”. We have now added a sentence to the outlook highlighting the relevance of testing those flux-force relationships in biological systems. 

      (3) I feel that further mechanistic explanation as to why bulk phase separation widens the regime of surface phase separation is warranted.  

      We have discussed the mechanistic explanation related to bulk protein interaction strength in the manuscript in the section: “Effects of binding affinity and interactions on surface phase separation”. We explained how the bulk interaction parameter affects the binding equilibrium. 

      (4) The major advantage of the non-dilute theory as compared with a best parameterized dilute (or homogenous) theory requires further clarification/evidence with respect to capturing the experimental data. 

      We thank reviewer for this helpful question. To address this point, we have added new paragraphs in the conclusion section, which explicitly discuss the necessity of employing the non-dilute theory for interpreting the experimental data.

      (5) Discrete (particle-based) molecular modelling could help to delineate the quantitative improvements that the non-dilute theory has over the previous state-of-the-art. Also, this could help test theoretical statements regarding the roles of bulk-phase separation, which were not explored experimentally.  

      We appreciate the suggestion and agree that such modeling would be valuable. However, this is beyond the scope of the current study. 

      (6) Discussion of the caveats and limitations of the theory and modelling is missing from the text. 

      We sincerely appreciate the reviewer’s helpful comment. We have added a discussion in the conclusion section outlining the caveats and limitations of our modeling approach.

      Reviewing Editor Comments: 

      Upon discussing with the reviewers, we feel that this manuscript could significantly be improved if testing the model with a different model system (beyond ZO1/tight junctions), in which case we foresee that we could enhance the strength of evidence from "compelling" to "exceptional". But of course, this is up to the authors to go for it or not, the paper is already very good. 

      Reviewer #2 (Recommendations for the authors): 

      (1) Lines 132-134: Re-word, the use of "complex" is confusing.

      We have rephrased the sentence for clarity. The revised version reads: ṽ<sub>_𝑃𝑅</sub>_ are the molecular volume and area of the protein-receptor complex ѵ<sub>𝑃𝑅</sub>, respectively”, and the changes have been in the revised manuscript.

      (2) Line 154 use of ""\nu"" for volume and area could be avoided for better clarity. 

      We thank the reviewer for this helpful suggestion. We have removed the statement involving ""\nu"" as these quantities have already been defined in the preceding context.

      (3) Line 158 the total "Helmholtz" free energy F... 

      We have added the word "Helmholtz" to the sentence.

      (4) Line 160 typo "In specific,..." 

      We carefully checked this sentence but could not identify a typo.  

      (5) For equation 5 explain the physical origins of each term, or provide a reference if this equation is explained elsewhere. 

      Thank you very much for your valuable suggestions. We have carefully rephrased Equation (5) and added a paragraph immediately afterward to provide a detailed explanation of its physical meaning.

      (6) Derivation on lines 163-174 is poorly written. Make the logical flow between the equations clearer. 

      We greatly appreciate your insightful suggestions. Equation (6) has been carefully revised for clarity, and the explanation has been rewritten to ensure better readability. All modifications are Done.

      (7) Define bold "t" in Equation 6. 

      The variable “t” has been explicitly defined in the context for clarity.

      (8) In equations. 7b-7c the nablas (gradients) should be the 2D versions.  

      We have updated the gradient operators in Equations (7b) and (7c) [Eq. (9) in revised manuscript]  to their 2D forms for consistency. 

      (9) Line 190, avoid referring to the future Equation 14, and state in words what is meant by "thermodynamic equilibrium". 

      We have added the explanation of “thermodynamic equilibrium” and remove the reference to equation accordingly.

      (10) In Equation 11 you don't explain what you are doing ( which is a perturbation around the minimum of the free energy). 

      We have revised the paragraph before equation (11) [Eq. (13) in revised manuscript] to clarify that the expression represents a perturbation around the minimum of the free energy.

      (11)  In Equation 12, doesn't this also depend on how you have written equation 6 (not just equation 5). 

      Eq. (12) [Eq. (14) in revised manuscript] is derived directly from the variation of the total free energy F. In contrast, Eq. (6) contains the time derivative of free energies that were not written in their final form. In the revised version, we have now given the conjugate forces and fluxes in Eqs. (7) and (8) for clarity.

      (12) Line 206 specify the threshold of local concentration (or provide a reference). 

      We have specified the threshold of local concentration in the revised text, and the corresponding statement has been highlighted.

      (13) Line 223 is the deviation from ideality captured in a pair-wise fashion? I presume it does not account for N many-body interactions?  

      Yes, our model is formulated within a mean-field framework that incorporates pairwise (second order) interaction coefficients. For example, 𝜒<sub>𝑃𝑅 -𝑅</sub> characterizes the interaction between the complex 𝑃𝑅 and the free receptor 𝑅, 𝜒<sub>𝑅 -L</sub> the interaction between free receptor 𝑅 and free lipid 𝐿, 𝜒<sub>𝑃𝑅-𝐿</sub> the interaction between complex 𝑃𝑅and free lipid 𝐿. We have stressed this choice of free energy in the revised manuscript.

      (14) Line 274, how do the authors know the secondary effects (of which they should mention a few) do not significantly impact the observed behaviour?  

      We sincerely thank the reviewer for the helpful comment. First, the parameters 𝜒<sub>𝑅 -L</sub> and 𝜒<sub>𝑃𝑅 -𝑅</sub> are not essential based on the experimental observations. For more information, please see our revised paragraph on the choice of the specific parameter values, which has been in the following Eq. (21).

      (15) It's not clear how Figures 3 b and c are generated with reference to which parameters are changed to investigate with/without bulk phase separation. 

      To improve clarity, we have revised Figure 3 to display the corresponding parameter values directly in each panel. Figures 3b and 3c were generated by computing the surface binding curves (as shown in Fig. 2) for each binding affinity 𝜔<sub>𝑃𝑅</sub> and membrane-complex interaction strength 𝜒<sub>𝑃𝑅-𝐿</sub>, under different bulk interaction strengths chi, to compare the cases with and without bulk phase separation. 

      (16) The jump between theory and the "Mechanism in ..." section is too much. The authors should include the biological context of tight junctions and ZO1 in the main introduction. 

      We appreciate the reviewer’s suggestion. Following this comment, we have added an extended discussion in the main introduction to provide the necessary biological context of tight junctions and ZO1. In addition, we inserted new bridging paragraphs between the theoretical section and the section “Mechanism in tight junction formation” to create a smoother transition from theory to experiments. These revisions help to better connect the theoretical framework with the biological phenomena discussed in the later section.

    1. Author response:

      Reviewer #2

      We respectfully disagree with Reviewer 2’s critiques, upon which the eLife assessment of “incomplete evidence” is primarily based. We believe these critiques do not accurately reflect our study and are rooted in a misinterpretation of the evidence. Consequently, we suggest that the conclusion of “incomplete evidence” is not warranted.

      On the basis of Reviewer 2’s critiques, the eLife assessment states: “However, the evidence presented is incomplete and, in particular, does not distinguish whether this suppression is due to reduced contrast or due to masking.” We emphasize that the suppression we observed is a consequence of interocular masking, not contrast reduction. Reviewer 2 cites Yuval-Greenberg and Heeger (2013), which proposes that during CFS, the mask reduces the gain of neural responses in V1 in a manner analogous to reducing stimulus contrast. We agree that both CFS masking and contrast reduction can decrease signal-to-noise ratio and thereby reduce visibility. However, in our paradigm, the physical stimulus contrast was held constant, while suppression was induced by interocular competition under CFS. This is a fundamentally different mechanism from lowering stimulus contrast. Our results therefore reflect genuine masking-induced suppression, rather than the effect of physical contrast reduction.

      Furthermore, Reviewer 2 cited Yuval-Greenberg and Heeger’s discussion that null results can arise from insufficient data, and suggested that this applies to our study. This main critique from Reviewer 2 is misplaced for two reasons: First, our main result is not a null effect. A null effect would mean that CFS masking had no impact on population orientation responses. Instead, we observed significant suppression, including abolished tuning in some conditions, which clearly indicates a strong effect of masking. Second, our findings are based on large neural populations recorded using two-photon calcium imaging, providing extensive sampling and high statistical power. Thus, concerns about “insufficient data” do not apply to our study.

      Finally, we used machine learning approaches to examine the effects of CFS masking on orientation discrimination and recognition, providing new insight into the long-standing debate over whether the brain can perform high-level cognitive processing under CFS. Although it is, to some extent, a matter of personal judgment whether our work represents a theoretical advance, Reviewer 2 made no comment, positive or negative, on this major component of our study while forming his/her judgment. (In response to Reviewer 3’s main concern about the suitability of SVMs, we now performed a multi-way classification analysis, which yielded results largely consistent with those obtained using the SVM approach in the original manuscript, confirming the robustness of our mechine learning results.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this study, participants completed two different tasks. A perceptual choice task in which they compared the sizes of pairs of items and a value-different task in which they identified the higher value option among pairs of items with the two tasks involving the same stimuli. Based on previous fMRI research, the authors sought to determine whether the superior frontal sulcus (SFS) is involved in both perceptual and value-based decisions or just one or the other. Initial fMRI analyses were devised to isolate brain regions that were activated for both types of choices and also regions that were unique to each. Transcranial magnetic stimulation was applied to the SFS in between fMRI sessions and it was found to lead to a significant decrease in accuracy and RT on the perceptual choice task but only a decrease in RT on the value-different task. Hierarchical drift diffusion modelling of the data indicated that the TMS had led to a lowering of decision boundaries in the perceptual task and a lower of nondecision times on the value-based task. Additional analyses show that SFS covaries with model derived estimates of cumulative evidence, that this relationship is weakened by TMS.

      Strengths:

      The paper has many strengths, including the rigorous multi-pronged approach of causal manipulation, fMRI and computational modelling, which offers a fresh perspective on the neural drivers of decision making. Some additional strengths include the careful paradigm design, which ensured that the two types of tasks were matched for their perceptual content while orthogonalizing trial-to-trial variations in choice difficulty. The paper also lays out a number of specific hypotheses at the outset regarding the behavioural outcomes that are tied to decision model parameters and well justified.

      We thank the reviewer for their thoughtful summary of the study and for highlighting these strengths. We are pleased that the multi-pronged approach combining causal manipulation, fMRI, and hierarchical drift–diffusion modelling, as well as the careful matching of perceptual content across the two tasks, came across clearly. We also appreciate the reviewer’s positive remarks on the specificity of our a priori hypotheses and their links to decision-model parameters. In revising the manuscript, we have aimed to further streamline the presentation of these hypotheses and to more explicitly connect the behavioural predictions, model parameters, and neural readouts throughout the Results and Discussion sections.

      Weaknesses:

      In my previous comments (1.3.1 and 1.3.2) I noted that key results could be potentially explained by cTBS leading to faster perceptual decision making in both the perceptual and value-based tasks. The authors responded that if this were the case then we would expect either a reduction in NDT in both tasks or a reduction in decision boundaries in both tasks (whereas they observed a lowering of boundaries in the perceptual task and a shortening of NDT in the value task). I disagree with this statement. First, it is important to note that the perceptual decision that must be completed before the value-based choice process can even be initiated (i.e. the identification of the two stimuli) is no less trivial than that involved in the perceptual choice task (comparison of stimulus size). Given that the perceptual choice must be completed before the value comparison can begin, it would be expected that the model would capture any variations in RT due to the perceptual choice in the NDT parameter and not as the authors suggest in the bound or drift rate parameters since they are designed to account for the strength and final quantity of value evidence specifically. If, in fact, cTBS causes a general lowering of decision boundaries for perceptual decisions (and hence speeding of RTs) then it would be predicted that this would manifest as a short NDT in the value task model, which is what the authors see.

      We thank the reviewer for raising these points and for the helpful clarification. We agree that, in principle, the architecture of the value-based task can be conceived as involving an upstream perceptual process that must be completed, to some degree, before value comparison can proceed. Under such a multistage framework, it is indeed possible that cTBS-induced changes in a perceptual decision stage could manifest as a reduction in boundary separation in the pure perceptual task, while the same perturbation appears as a shortening of non-decision time (NDT) when fitting a single-stage DDM to the value task. In this sense, our earlier statement that a “general speeding effect” would necessarily produce identical parameter changes (either NDT or boundaries) in both tasks was too strong, and we are grateful to the reviewer for pointing this out.

      At the same time, this alternative explanation remains fully compatible with our central claim that the left SFS plays a perceptual rather than value-based role. We agree with the reviewer that there must be a stimulus-related circuit (in visual and parietal regions) that encodes the physical attributes of the options, and that this upstream processing can influence both tasks. However, a large body of work suggests that left SFS is not part of this primary identification circuitry, but rather contributes specifically to the accumulation and comparison of sensory evidence (e.g., Heekeren et al., 2004, 2006), downstream from areas such as FFA, PPA, or MT/V5 that encode stimulus identity. In other words, stimulus identification (forming a representation of “what is where”) is anatomically and functionally distinct from the accumulation of evidence toward a perceptual decision. Within this framework, the reviewer’s proposal that cTBS speeds “perceptual decisions” across tasks can be understood as targeting precisely the evidence-accumulation stage we ascribe to SFS, with the value-comparison stage proper likely implemented in other regions (e.g., vmPFC and connected valuation circuitry).

      We therefore do not rely solely on the dissociation between boundary changes in the perceptual task and NDT changes in the value task as decisive evidence against a “general speeding” account. Instead, our interpretation is based on the convergence of behavioural, model-based, and neural results. First, in the perceptual task, cTBS to left SFS leads to a selective reduction in decision boundary and a concomitant change in trialwise BOLD activity within the stimulated region that covaries with perceptual choice behaviour and with the latent decision variable inferred from the HDDM. Second, in the value task, cTBS does not affect value sensitivity or accuracy, nor does it alter value-related drift or boundary parameters; the only robust HDDM effect is a modest shortening of NDT. Third, critically, left SFS BOLD activity is modulated by perceptual evidence and by cTBS in the perceptual task, but we observe no evidence that SFS activity encodes value evidence or shows value-related cTBS neuronal effects in the value task.

      Taken together, these findings indicate that the left SFS serves a causal role in the accumulation of perceptual evidence and in the setting of the choice criterion for perceptual decisions. The reviewer’s suggestion that cTBS may induce a general speeding of perceptual processes that also influences the value task is compatible with this conclusion, in the sense that any contribution of SFS to the value task is best understood as acting via a perceptual component that is upstream of value comparison, rather than via the value accumulation process itself. We have clarified this point in the Discussion of the revised manuscript and now explicitly acknowledge that our DDM dissociation alone does not exclude a general perceptual speeding account, but that the combination of task-specific neural effects in SFS, preserved value-based choice behaviour, and the absence of value-related BOLD changes in SFS strongly support a primarily perceptual role for this region.

      Reviewer #2 (Public review):

      Summary:

      The authors set out to test whether a TMS-induced reduction in excitability of the left Superior Frontal Sulcus influenced evidence integration in perceptual and value-based decisions. They directly compared behaviour-including fits to a computational decision process model---and fMRI pre and post TMS in one of each type of decision-making task. Their goal was to test domain-specific theories of the prefrontal cortex by examining whether the proposed role of the SFS in evidence integration was selective for perceptual but not value-based evidence.

      Strengths:

      The paper presents multiple credible sources of evidence for the role of the left SFS in perceptual decision making, finding similar mechanisms to prior literature and a nuanced discussion of where they diverge from prior findings. The value-based and perceptual decision-making tasks were carefully matched in terms of stimulus display and motor response, making their comparison credible.

      We thank the reviewer for their clear summary of our aims and approach, and for highlighting these strengths. We are pleased that the convergence between causal TMS, fMRI, and hierarchical modelling comes across as providing credible evidence for the role of left SFS in perceptual decision-making, and that our attempt to link these results to the existing literature is seen as appropriately nuanced. We also appreciate the reviewer’s positive assessment of the task design, in particular the close matching of perceptual content and motor output across perceptual and value-based decisions, which was central to our goal of testing domain-specific theories of prefrontal function. In revising the manuscript, we have further clarified these design choices and their rationale, and we have streamlined the exposition of how the hypotheses, model parameters, and neural readouts are connected across the two decision domains.

      Weaknesses:

      I was confused about the model specification in terms of the relationship between evidence level and drift rate. While the methods (and e.g. supplementary figure 3) specify a linear relationship between evidence level and drift rate, suggesting, unless I misunderstood, that only a single drift rate parameter (kappa) is fit. However, the drift rate parameter estimates in the supplementary tables (and response to reviewers) do not scale linearly with evidence level.

      We thank the reviewer for raising this point and appreciate the opportunity to clarify the model specification. In our hierarchical DDM, we did not fit separate, free drift parameters for each evidence level. As shown in Supplementary Fig. 3, the drift on each trial is specified as

      where 𝐸<sub>𝑐,𝑠,𝑖</sub> the trial-wise evidence (difference in size or value) and κ<sub>𝑐,𝑠</sub> is a single drift-scaling parameter per condition and session. Thus, the linear dependence of drift on evidence is implemented at the trial level via 𝜅; we do not estimate independent 𝛿 parameters for each evidence level.

      In Supplementary Tables 8 and 9 we report, for descriptive purposes, the posterior means of 𝛿 conditional on each evidence bin (levels 1–4), alongside the corresponding decision boundary and nondecision time summaries. These values are therefore derived quantities that reflect the combination of (i) the single κ<sub>𝑐,𝑠</sub> parameter, (ii) the empirical distribution of continuous evidence values 𝐸 within each bin, and (iii) hierarchical pooling across subjects and sessions. Consequently, they are expected to increase monotonically with evidence level—as they do in our data—but not to lie exactly on a straight line in the discrete level index, because the underlying evidence bins are not equally spaced in physical units and because of between-subject variability and posterior uncertainty.

      We will revise the text and table captions to make clear that the evidence-level entries are descriptive summaries of 𝛿 implied by the 𝜅×𝐸 formulation, rather than independently estimated drift parameters, in order to avoid this confusion.

      -The fit quality for the value-based decision task is not as good as that for the PDM, and this would be worth commenting on in the paper.

      We agree that the HDDM fit for the value-based task is somewhat weaker than for the perceptual task. This is reflected in the somewhat higher DIC values for VDM compared with PDM and in slightly broader posterior-predictive distributions (Supplementary Tables 8–11 and Supplementary Figs. 11–16). We believe this difference primarily reflects the greater intrinsic variability of subjective value-based choices (e.g. trial-to-trial fluctuations in preferences, satiety, or attention), coupled with our decision to use the same relatively simple DDM architecture for both tasks to allow a principled cross-task comparison. Importantly, posterior-predictive checks show that, for VDM as well, the model adequately reproduces both accuracy and full RT distributions at the group and subject level (Supplementary Figs. 11–16), indicating that the fit quality is sufficient for our purposes. In the revised manuscript we now explicitly note that the model captures PDM behaviour more tightly than VDM and that this may reduce sensitivity to very small cTBS effects on value-based decision parameters, even though no systematic effects are evident in our data. Crucially, our central conclusion—that left SFS plays a domain-specific role in setting the decision boundary for perceptual evidence—relies on the robust behavioural, computational, and neural effects observed in PDM and does not depend on assuming a perfect model fit for VDM.

      - Supplementary Figure 3 specifies the distribution for kappa hyper-parameter twice.

      We thank the reviewer for spotting this typo. We have revised Supplementary Figure 3 legend.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #2 (Public review):

      Summary:

      The current article presents a new type of analytical approach to the sequential organisation of whale song units.

      Strengths:

      The detailed description of the internal temporal structure of whale songs is something that has been thus far lacking.

      Weaknesses:

      The conceptual and terminological bases of the paper are problematical and hamper comparison with other taxa, including humans. According to signal theory, codas are indexical rather than symbolic. They signal an individual's group identity. Borrowing from humans and linguistics, coda inter-group variation represents a case of accents - phonologically different varieties of the same call - not dialects, confirming they are an index. This raises serious doubt about whether alleged "symbolism" and similarity between whale and human vocal behaviour is factual.

      We respect that the reviewer does not agree with describing codas as symbolic markers of cultural identity in sperm whales, but ultimately we find the quantitative evidence presented in Hersh et al. (2022) compelling, and stand by the framing of our manuscript, which builds on this foundation.

      The same applies to the difference between ICIs (inter-click interval) and IOIs (inter-onset interval). If the two are equivalent, variation in click duration needs to be shown so small that can be considered negligible. This raises serious doubt about whether the alleged variation in whale codas is indeed rhythmic in nature and prevents future efforts for comparison with the vocal capacities of other species. The scope and relevance of this paper for the broader field is limited.

      We believe there has been a miscommunication. Coda inter-click intervals are calculated as the time between the onsets of sequential clicks within a coda. This is identical to definitions of inter-onset intervals in many publications, including:

      • Burchardt and Knörnschild (2020): “the duration between the beginning of one element and the next”

      • Friberg and Battel (2002): “the time interval between the onset of the tone and the onset of the immediately following tone”

      • De Gregorio et al. (2021): “the time between the onset of a note and the next one”

      In response to a comment from this reviewer in the first round of revisions, we made the point that we do not believe rhythm analyses need be restricted to inter-onset intervals alone. Regardless of that stance, we did analyze inter-onset intervals in this manuscript and accordingly are capturing aspects of rhythm in our analyses. We have removed a poorly worded sentence in our introduction and apologize for any confusion it caused. We have also made this explicit in lines 30–35: “This classification is based on the total number of clicks and their rhythm and tempo extrapolated from the time interval between the onsets of consecutive clicks: the inter-click interval (ICI) [15, 16] (Fig. 1A). This measure is equivalent to the inter-onset intervals (IOIs) often used in rhythm analyses [17, 18, 19] but for the sake of compatibility with studies on sperm whale acoustics, we use ICI terminology throughout this paper.”

      In our analyses, inter-click intervals and inter-onset intervals are equivalent measures.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      My concerns regarding interdisciplinary terminology and methods remain unaddressed. The study's inaccurate terminology hinders reliable comparison with other taxa, including humans. Being "symbolic" bears no weight on the new method that the authors present, thus, the unwillingness for compatibility is limiting and perplexing. The authors state that codas have been previously described as being symbolic, but just because poor terminology has been used before doesn't justify perpetuating it, especially when it confounds and conflicts with broader comparative efforts.

      We agree that being symbolic bears no weight on the new method we present, but we believe it does bear weight on our interpretation of what our method reveals about patterns in sperm whale communication. For that reason, we have opted to maintain the current framing of our manuscript.

      The same applies to the difference between ICIs and IOIs. The authors resist amending terminology, even though they state the two represent the same measure. If so, want prevents the correct use of IOIs?

      We have opted to use ICI throughout the paper because it is standard terminology in sperm whale acoustics, but we have now made the ICI/IOI equivalence explicitly clear in the introduction.

      References:

      Burchardt LS, Knörnschild M. 2020. Comparison of methods for rhythm analysis of complex animals’ acoustic signals. PLoS Computational Biology 16. doi:10.1371/journal.pcbi.1007755

      De Gregorio C, Valente D, Raimondi T, Torti V, Miaretsoa L, Friard O, Giacoma C, Ravignani A, Gamba M. 2021. Categorical rhythms in a singing primate. Current Biology 31:R1379–R1380. doi:10.1016/j.cub.2021.09.032

      Friberg A, Battel GU. 2002. Structural communication In: Parncutt R, McPherson G, editors. The Science & Psychology of Music Performance: Creative Strategies for Teaching and Learning. Oxford University Press. doi:10.1093/acprof:oso/9780195138108.001.0001

      Hersh TA, Gero S, Rendell L, Cantor M, Weilgart L, Amano M, Dawson SM, Slooten E, Johnson CM, Kerr I, Payne R, Rogan A, Andrews O, Ferguson EL, Hom-Weaver CA, Norris TF, Barkley YM, Merkens KP, Oleson EM, Doniol-Valcroze T, Pilkington J, Gordon J, Fernandes M, Guerra M, Hickmott L, Whitehead H. 2022. Evidence from sperm whale clans of symbolic marking in non-human cultures. Proceedings of the National Academy of Sciences 119:e2201692119. doi:10.1073/pnas.2201692119

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This manuscript uses adaptive sampling simulations to understand the impact of mutations on the specificity of the enzyme PDC-3 β-lactamase. The authors argue that mutations in the Ω-loop can expand the active site to accommodate larger substrates.

      Strengths:

      The authors simulate an array of variants and perform numerous analyses to support their conclusions. The use of constant pH simulations to connect structural differences with likely functional outcomes is a strength.

      Weaknesses:

      I would like to have seen more error bars on quantities reported (e.g., % populations reported in the text and Table 1).

      We appreciate this point. Here, the population we analyze is intended to showcase conformational differences across variants rather than to estimate equilibrium occupancies. Although each system includes 100 trajectories, they were generated using an adaptive-bandit protocol. The protocol deliberately guides towards underexplored basins, therefore conformational heterogeneity betweentrajectories is expected by design. For example, in E219K the MSM decomposition shows that in states 1, 6, and 7 the K67(NZ)–S64(OG) distance is almost entirely > 6 Å, whereas in states 2 and 3 it is almost entirely < 3.5 Å (Figure 5—figure supplement 12). These distances suggest that the hydrogen bond fraction is approximately zero in states 1, 6, and 7, and close to one in states 2 and 3. In addition, the mean first passage time of the Markov state models suggests that the formation and disruption of this hydrogen bond occur on the microsecond timescale, which is far longer than the length of each individual trajectory (300 ns). Consequently, across the 100 replicas, some trajectories exhibit very low fractions, while others display the opposite trend. Under such bimodal, protocol-induced heterogeneity, computing an error bar across trajectories mainly visualizes the protocol’s dispersion and risks being misread as thermodynamic uncertainty, which is not central to our aim of comparing conformational differences between wild-type PDC-3 and variants. We therefore do not include the error bars. 

      Reviewer #2 (Public review):

      Summary:

      In the manuscript entitled "Ω-Loop mutations control dynamics of the active site by modulating the 3 hydrogen-bonding network in PDC-3 4 β-lactamase", Chen and coworkers provide a computational investigation of the dynamics of the enzyme Pseudomonas-derived cephalosporinase 3 (PDC3) and some mutants associated with increased antibiotic resistance. After an initial analysis of the enzyme dynamics provided by RMSD/RMSF, the author concludes that the mutations alter the local dynamics within the omega loop and the R2 loop. The authors show that the network of hydrogen bonds is disrupted in the mutants. Constant pH calculations showed that the mutations also change the pKa of the catalytic lysine 67, and pocket volume calculations showed that the mutations expand the catalytic pocket. Finally, time-independent component analysis (tiCA) showed different profiles for the mutant enzyme as compared to the wild type.

      Strengths:

      The scope of the manuscript is definitely relevant. Antibiotic resistance is an important problem, and, in particular, Pseudomonas aeruginosa resistance is associated with an increasing number of deaths. The choice of the computational methods is also something to highlight here. Although I am not familiar with Adaptive Bandit Molecular Dynamics (ABMD), the description provided in the manuscript suggests that this simulation strategy is well-suited for the problem under evaluation.

      Weaknesses:

      In the description of many of their results, the authors do not provide enough information for a deep understanding of the biochemistry/biophysics involved. Without these issues addressed, the strength of the evidence is of concern.

      We thank the reviewer for pointing out the need for deeper discussion of the biochemical and biophysical implications of our results. In our manuscript, we begin by examining basic structural metrics (e.g., RMSD and RMSF) which clearly indicate that the major conformational changes occur in the Ω-loop and the R2 loop. We have now added a paragraph to describe the importance of the Ωloop and highlighted it in the revised manuscript on lines 142-166 of page 6. This observation guided our subsequent focus on these regions, as well as on the catalytic site. Our analysis revealed notable alterations in the hydrogen bonding network—especially in interactions involving the K67-S64, K67N152, K67-G220, Y150-A292, and N287-N314 pairs. These observations led us to conclude that:

      (1) Mutations E219K and Y221A facilitate the proton transfer of catalytic residues. This is consistent with prior experimental data showing that these substitutions produce the most pronounced increase in sensitivity to cephalosporin antibiotics (lines 210-212 in page 8 of the revised manuscript). 

      (2) Substitutions enlarge the active-site pocket to accommodate bulkier R1 and R2 groups of β-lactams.This is in line with MIC measurements reported by Barnes et al. (2018), which showed that mutants with larger active-site pockets exhibit markedly greater sensitivity to cephalosporins with bulky side chains than others (lines 249-259 in pages 10).

      Furthermore, we applied Markov state models (MSMs) to explore the timescales of the transitions between these different conformational states. We believe that these methodological steps support our conclusions.

      Reviewer #3 (Public review):

      Summary:

      This manuscript aims to explore how mutations in the PDC-3 3 β-lactamase alter its ability to bind and catalyse reactions of antibiotic compounds. The topic is interesting, and the study uses MD simulations to provide hypotheses about how the size of the binding site is altered by mutations that change the conformation and flexibility of two loops that line the binding pocket. However, the study doesn't clearly describe the way the data is generated. While many results appear significant by eye, quantifying this and ensuring convergence would strengthen the conclusions.

      Strengths:

      The significance of the problem is clearly described, and the relationship to prior literature is discussed extensively.

      Weaknesses:

      The methods used to gain the results are not explained clearly, meaning it was hard to determine exactly how some data was obtained. The convergence and uncertainties in the data were not adequately quantified. The text is also a little long, which obscures the main findings.

      We thank the reviewer for the suggestion. We respectfully ask the reviewer to specify which aspects of the data-generation methods are unclear so that we can include the necessary details in the next revision. Moreover, all statistics that are reported in the manuscript are obtained from extensive analyses of 300,000 simulation frames. The Markov state models have been validated by the ITS plots and Chapman-Kolmogorov (CK) test. The two-sample t-tests were also carried out for the volume and SASA.

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 1D focus on the PDC3 catalytic site. However, the authors mentioned before that the enzyme has two domains, an alpha domain and an alpha/beta domain. The reader would benefit from a more detailed description of the enzyme, its active site, AND the location of the mutants under investigation in the figure.

      We have updated Figure 1D and marked the positions of all mutations (V211A/G, G214A/R, E219A/G/K and Y221A/H), which have now been highlighted as spheres.

      (2) Since in the journal format, the results come before the methods. It would be interesting to add a brief description of where the results came from. For example, in the first section of the results, the authors describe the flexibility of the omega loop and the R2 loop. However, the reader won't know what kind of simulation was used and for how long, for example. A sentence would add the required context for a deeper understanding here.

      At the beginning of the Results and Discussion section we now state: “To investigate how the mutations in the Ω-loop affect PDC-3 dynamics, adaptive-bandit molecular dynamics (AB-MD) simulations were carried out for each system. 100 trajectories of 300 ns each (totaling 30 μs per system) were run.”

      (3) Still in the same section, the authors don't define what change in RMSF is considered significant. For example, I can't see a relevant change in the RMSF for the omega loop between the et enzyme and the E219 mutants in Figure 2D. A more objective definition would be of benefit here.

      Our analysis reveals that while the wild-type PDC-3 and the G214A, G214R, E214G, and Y221A variants exhibit an average per-residue RMSF of around 4 Å in the Ω-loop, the V211A and V211G variants show markedly lower values (around 1.5 Å), and the E219K and Y221H variants exhibit intermediate values between 2 and 2.5 Å. In addition, the fluctuations around the binding site should be seen collectively along with the fluctuations in the R2-loop. Importantly, we urge the reviewer to focus on the MDLovofit analysis in Figure 2C, where the dynamic differences between the core and the fluctuating loops is clearly evident.  

      (4) In line 138, the authors state that "Therefore, the flexibility of these proteins is mainly caused by the fluctuations in the Ω-loops and R2-loop". This is quite a bold statement to be drawn at this point. First of all, there is no mention of it in the manuscript, but is there any domain movement? Figure 2C clearly shows that there is some mobility in omega and R2 loops. But there is no evidence shown in the manuscript that shows that "the flexibility of these proteins is mainly caused by the fluctuations in the" loops. Please consider rephrasing this sentence or adding more data, if available.

      We have revised the wording to take the reviewer’s concern into account. The sentence now states: “Therefore, flexibility of PDC-3 is predominantly localized to the Ω- and R2-loops, whereas the remainder of the structure is comparatively rigid.” To further explain to the reviewer, the β lactamase enzymes are fairly rigid structures, where no large-scale domain motions occur. Instead, the enzyme communicates structurally via cross correlation of loop dynamics ( https://doi.org/10.7554/eLife.66567 ).  

      (5) I guess, the most relevant question for the scope of the paper is not answered in this section. The authors show that the mobility of the omega- and R2-loops is altered by some mutations. Why is that? I wish I could see a figure showing where the mutations are and where the loops are. This question will come back in other sections.

      We have updated Figure 1D to mark the positions of all mutations (V211A/G, G214A/R, E219A/G/K and Y221A/H) as spheres. The Ω- and R2-loops are also highlighted. All mutations map to the Ω-loop, indicating that these substitutions directly perturb this region. Notably, K67 forms a hydrogen bond with the backbone of G220 within the Ω-loop and another with the phenolic hydroxyl of Y150. Y150, in turn, hydrogen-bonds with A292 in the R2 loop. Together, the residue interaction network (G220– K67–Y150–A292) suggest a pathway by which Ω-loop mutations propagate their effects to the R2 loop.

      (6) The authors then analyze the network of polar residues in the active site and the hydrogen bonds observed there. For the K67-N152 hydrogen bond, for example, there is a reduction in the occupancy from ~70% in the wild-type enzyme to ~30% and 40% in the mutants E219K and Y221, respectively. This finding is interesting. The question that remains is "why is that"? From the structural point of view, how does the replacement of E219 with a Lysine alter the hydrogen bond formation between K67 and N152? Is it due to direct competition? Solvent rearrangement? The reader is left without a clue in this section. Also, Figure 3B won't help the reader, since the mutated residues are not shown there. Please consider adding some information about why the authors believe that the mutations are disrupting the active site hydrogen bond network and showing it in Figure 3B.

      We appreciate the comment and have updated Figures 1D and 3B to highlight the mutation sites. The change from ~70% in the wild type to ~30–40% in the E219K and Y221T variants reported in Table 1 refers to the S64–K67 hydrogen bond. In the wild type, K67 forms an additional hydrogen bond with G220 on the Ω-loop, which helps anchor the K67 side chain in a geometry that favors the S64–K67 interaction. In the variants, the mutations reshape the Ω-loop and frequently disrupt the K67–G220 contact. The loss of this local anchor increases the conformational dispersion of K67, which is consistent with the observed reduction of the S64–K67 occupancy. Furthermore, our observation that the mutations are disrupting the active-site hydrogen-bond network is a data-driven conclusion rather than a subjective inference. Across ten systems, our AB-MD simulations provided 30 µs of sampling per system. Saving one frame every nanosecond yielded 30,000 conformations per system and 300,000 in total. All hydrogen-bond and salt-bridge statistics were computed over this full ensemble. Thus, the conclusion that the mutations disrupt the active-site hydrogen-bond network follows directly from these ensemble statistics. 

      (7) The pKa calculations and the pocket volume calculations show that the mutations expand the volume of the catalytic site and alter the microenvironment. Is there any change in the solvation associated with these changes? If the volume expands and the environment becomes more acidic, are there more water molecules in the mutants as compared to the wt enzyme? If so, can changes in solvation be associated with the changes in the hydrogen bond network? Would a simulation in the presence of a substrate be meaningful here? ( I guess it would!).

      Regarding solvation, we observe a modest increase in transient water occupancy associated with the increase in volume of the pocket. The conserved deacylation water molecule is the most important and is always present throughout the simulation. Additional waters enter and leave the pocket but do not form persistent interactions that measurably perturb the hydrogen-bond network of the Ω- and R2-loops. We agree that simulations with a bound substrate would be informative. However, our study focuses on how Ω-loop mutations modulate the active site of apo PDC-3 and its variants. Within this scope, we find: (i) Amino acid substitutions change the flexibility of Ω-loops and R2-loops; (ii) E219K and Y221A mutations facilitate the proton transfer; (iii) Substitutions enlarge the active-site pocket to accommodate bulkier R1 and R2 groups of β-lactams.

      (8) I have some concerns regarding the Markov State Modeling as shown here. After a time-independent component analysis, the authors show the projections on the components, which is different between wild wild-type enzyme and the mutants, and draw some conclusions from these changes. For example, the authors state that "From the metastable state results, we observe that E219K adopts a highly stable conformation in which all the tridentate hydrogen-bonding interactions (K67(NZ)-S64(OG), K67(NZ)N152(OD1) and K67(NZ)-G220(O) mentioned above are broken". This is conclusion is very difficult to draw from Figure 5 alone. Unless the macrostates observed in the MSM can be shown (their structures) and could confirm the broken interactions, I really don't believe that the reader can come to the same conclusion as drawn by the authors here. I would recommend the authors to map the macrostates back to the coordinates and show them (what structure corresponds to what macrostate). After showing that, it makes sense to discuss what macrostate is being favored by what mutation. Taking conclusions from tiCA projections only is not recommended. I very strongly suggest that the authors revisit this entire section, adding more context so that the reader can draw conclusions from the data that is shown.

      We appreciate the reviewer’s concern. In the Markov state modeling section, our objective is to quantify the timescales (via mean first passage times) associated with the formation and disruption of the critical hydrogen bonds (K67(NZ)-S64(OG), K67(NZ)-N152(OD1), K67(NZ)-G220(O), Y150(N)A292(O), N287(ND2)-N314(OD1)) mentioned above. Representative structures illustrating these interactions are shown in Figures 3B and 4A. We agree that the main Figure 5 alone does not convey structural information. Accordingly, we provide Figure 5—figure supplements 12–16. Together, Figure 5B and Figure 5—figure supplements 12–16 map structures to metastable states, whereas Figures 3B and 4A supply atomistic detail of the interactions. Author response image 1 presents selected subplots from Figure 5— figure supplements 12–14. Together with the free-energy landscape in Figure 5A, these data indicate that E219K adopts a highly stable conformation in which all three K67-centered hydrogen bonds (K67(NZ)–S64(OG), K67(NZ)–N152(OD1), and K67(NZ)–G220(O)) are broken.

      Author response image 1.

      TICA plot illustrates the distribution of E219K with the colour indicating the K67(NZ)-S64(OG), K67(NZ)-N152(OD1) and K67(NZ)-G220(O) distance.

      (9) As a very minor issue, there are a few typos in the manuscript text. The authors might want to take some time to revisit their entire text. Examples in lines 70, 197, etc.

      Thank you for your comment. We have corrected these typos.

      Reviewer #3 (Recommendations for the authors):

      This manuscript aims to explore how mutations in the PDC-3 3 β-lactamase alter its ability to bind and catalyse reactions of antibiotic compounds. The topic is interesting, and the study uses MD simulations to provide hypotheses about how the size of the binding site is altered by mutations that change the conformation and flexibility of two loops that line the binding pocket.

      However, the study doesn't clearly describe the way the data is generated and potentially lacks statistical rigour, which makes it uncertain if the key results are significant. As such, it is difficult to judge if the conclusions made are supported by data.

      All necessary data-acquisition methods are described in the Methods section. The Markov state models have been validated by the ITS plot and the Chapman-Kolmogorov (CK) test (Figure 5—figure supplement 2–11) . The two-sample t-tests were also carried out for the volume and SASA (Table 2).

      The results section jumps straight to reporting RMSD and RMSF values; however, it is not clear what simulations are used to generate this information. Indeed, the main text does not mention the simulations themselves at all. The methods section mentions that 10 independent MD simulations were set up for each system, but no information is given as to how long these were run or the equilibration protocol used. Then it says that AB-MD simulations were run, but it is not clear what starting coordinates were used for this or how the 10 replicates were fed into these simulations. Most importantly, are the RMSD and RMSF calculations and later distance distribution information derived from the equilibrium MD runs or from the AB-MD simulations?

      Thank you for pointing this out. We have added “To investigate how the mutations in the Ω-loop affect PDC-3 dynamics, adaptive-bandit molecular dynamics (AB-MD) simulations were carried out for each system. 100 trajectories of 300 ns each (totaling 30 μs per system) were run.” to the Results and Discussion section. We didn’t run 10 independent MD simulations per system. We regret the typo in the Methods section that confused the reviewer. The sentence should have read – ‘All-atom MD simulations of wild-type PDC-3 and its variants were performed.’ Each system was equilibrated for 5 ns at 1 atmospheric pressure using Berendsen barostat. AB-MD simulations were initiated from these equilibrated structures. All analyses, apart from CpHMD, are based on the AB-MD trajectories.

      If these are taken from the equilibrium simulations, then it is critical that the reproducibility and statistical significance of the simulations is established. This can be done by calculating the RMSD and RMSF values independently for each replicate and determining the error bars. From this, the significance of differences between WT and mutant simulations can be determined. Without this, I have no data to judge if the main conclusions are supported or not. If these are derived from the AB-MD simulations, then I want to know how the independent simulations were combined and reweighted to generate overall RMSD, RMSF, and distance distributions. Unless I misunderstand the approach, the individual simulations no longer sample all regions of conformational space the same relative amount you would see in a standard MD simulation - specific conformational regions are intentionally run more to enhance sampling, then the overall conformational distributions cannot be obtained from these simulations without some form of reweighting scheme. But no such scheme is described. In addition, convergence of the data is required to ensure that the RMSD, RMSF, and distances have reached stable values. It is possible that I am misunderstanding the approach here. But in that case, I hope the authors can clarify the method and provide a means of ensuring that the data presented is converged. Many of the differences are clear by eye, but it is important to know they are not random differences between simulations and rather reflect differences between them.

      Thank you for raising this important point. In our AB-MD workflow, the adaptive bandit is used only for starting-structure selection (adaptive seeding). After each epoch, it chooses new starting snapshots from previously sampled conformations and launches the next runs. Each trajectory itself is standard, unbiased MD with no biasing potentials and no modification of the Hamiltonian. In other words, AB decides where we start, but does not alter the physics or sampling dynamics within an individual trajectory. In addition, our goal in this work is to compare variants under the same adaptive-bandit (AB) protocol, rather than to estimate equilibrium (Boltzmann) populations. Hence, we did not apply equilibrium reweighting to RMSD, RMSF, or distance distributions. However, MSM section provides reweighted reference results based on the MSM stationary distribution.

      In the response to reviews, the authors state that the "RMSF is a statistical quantity derived from averaging the time series of atomic displacements, resulting in a fixed value without an inherent error bar." But normally we would run multiple replicates and get an error bar from the different values in each. To dismiss the request for uncertainties and error bars seems to miss the point. I strongly agree with the prior reviewer that comparisons between RMSF or other values should be accompanied by uncertainties and estimates of statistical significance.

      Regarding the reviewers’ suggestion to present the data as a bar graph with error bars, we would like to note that RMSF is calculated as the time average of the fluctuations of each residue’s Cα atom over the entire simulation. As such, RMSF is a statistical quantity derived from averaging the time series of atomic displacements, resulting in a fixed value without an inherent error bar. We believe that our current presentation clearly and accurately reflects the local flexibility differences among the variants. Nearly all published studies report RMSF in this way, as indicated by the following examples:

      Figure 3a in DOI: https://doi.org/10.1021/jacsau.2c00077

      Figure 2 in DOI: https://doi.org/10.1021/acs.jcim.4c00089

      Supplementary Fig. 1, 2, 5, 9, 12, 20, 22, 24, and 26 in DOI: https://doi.org/10.1038/s41467-022-293313

      However, in response to the reviewers’ strong request, we present RMSF plots with error bars in our response letter. 

      Author response image 2.

      The root-mean-square fluctuation (RMSF) profiles of wild-type PDC-3 and its variants. Blue lines show the mean RMSF across 100 independent MD trajectories for each system; red translucent bands denote the standard deviation across trajectories. The Ω-loop (residues G183 to S226) is highlighted in yellow, and the R2-loop (residues L280 to Q310) is highlighted in blue.

      It was good to see that convergence of the constant-pH simulations was shown. While it can be challenging to get absolute pH values from the implicit solvent-based simulations, the differences between the systems are large and the trends appear significant. I was not clear how the starting coordinates were chosen for these simulations. Is the end point of the classical simulations, or is a representative snapshot chosen somehow?

      To ensure comparison, all systems used the X-ray crystal structure (PDB ID: 4HEF) with T79A substitution as the initial structure. The E219K and Y221A mutants were generated in silico using the ICM mutagenesis module. We have added the clarification in Methods section: “The starting structures were identical to those used for AB-MD.”

      Significant figures: Throughout the text and tables, the authors present data with more figures than are significant. 1071.81+-157.55 should be reported as 1100 +/ 160 or 1070 =- 160 . See the eLife guidelines for advice on this.

      Thank you for your suggestion. We have amended these now. 

      The manuscript is very long for the results presented, and I feel that a clearer story would come across if the authors shortened the text so that the main conclusions and results were not lost.

      We appreciate the suggestion. We examined the twenty most recent research articles published in eLife and found that they are either longer than or comparable in length to our manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review) :

      Comments on revisions:

      The revised manuscript has responded to the previous concerns of the reviewers, albeit modestly. The overemphasis on hypoxic adaptation of the clinical isolates persist as a key concern in the paper. The authors have compared the growth-curve of each of the clinical and ATCC strains under normal and hypoxic conditions (Fig. 8), but don't show how mutations in some of the genes identified in Tn-seq would impact the growth phenotype under hypoxia. They largely base their arguments on previously published results.

      As I mentioned previously, the paper will be better without over-interpreting the TnSeq data in the context of hypoxia.

      Thank you for the comment on the issue of not determining the impact of individual gene mutations identified in TnSeq on the growth phenotypes under hypoxia.

      We agree that the lack of validation of TnSeq results is a limitation of this study. Without evidence of growth pattern of each gene-deletion mutant under hypoxia there might be a risk of over-interpretating the data, even though the data are carefully interpreted based on previous reports. We consider that it is necessary to confirm the phenomenon by using knockout mutants.

      We have just recently succeeded in constructing the vector plasmids for making knockout mutants of M intracellulare (Tateishi. Microbiol Immunol. 2024). We will proceed to the validation experiment of TnSeq-hit genes by constructing knockout mutants. We already mentioned this point as a limitation of this study in the Discussion (pages 35-36 lines 630-640 in the revised manuscript).

      Reference.

      Tateishi, Y., Nishiyama, A., Ozeki, Y. & Matsumoto, S. Construction of knockout mutants in Mycobacterium intracellulare ATCC13950 strain using a thermosensitive plasmid containing negative selection marker rpsL+. Microbiol Immunol 68, 339-347 (2024).

      Other points:

      The y-axis legends of plots in Fig.8c are illegible.

      Following the comment, we have corrected Figure 8c and checked the uploaded PDF

      The statements in lines 376-389 are convoluted and need some explanation. If the clinical strains enter the log phase sooner than ATCC strain under hypoxia, then how come their growth rates (fig. 8c) are lower? Aren't they expected to grow faster?

      Thank you for the comment on the interpretation of the difference in bacterial growth under hypoxia between MAC-PD strains and the ATCC type strain. The growth curve consists of the onset of logarithmic growth and its growth speed. In this study, we evaluated the former as timing of midpoint and the latter as growth rate at midpoint. Timing of midpoint and growth rate at midpoint are individual parameters. The early entry to log-phase does not mean the fast growth rate at midpoint.

      Our results demonstrated that 5 (M.i.198, M.i.27, M003, M019 and M021) out of 8 clinical MAC-PD strains entered log-phase early and continued to grow logarithmically long time (slow growth). This data suggests the capacity for MAC-PD to continue replication long time under hypoxic conditions. By contrast, the ATCC type strain showed delayed onset of logarithmic growth caused by long-term lag phase. The duration of logarithmic growth was short even once after it started. The log phase soon transited to the stationary phase. This data suggests the lower capacity for the ATCC strain to continue replication under hypoxic conditions.

      Following the comment, we have added the interpretation of the growth curve pattern as follows (page 22 lines 379-392 in the revised manuscript): “The growth rate at midpoint under hypoxic conditions was significantly lower in these 5 clinical MAC-PD strains than in ATCC13950. The early entry to log phase followed by long-term logarithmic growth (slow growth rate at midpoint) suggests the capacity for these 5 clinical MAC-PD strains to continue replication long time under hypoxic conditions. On the other hand, the rest 3 clinical MAC-PD strains (M018, M001 and MOTT64) did not show significant change in the growth rate between aerobic and hypoxic conditions, suggesting that there are different levels of capacity in maintaining long-term replication under hypoxia among clinical MAC-PD strains. In ATCC13950, the entry to log phase was significantly delayed under 5% oxygen compared to aerobic conditions, and the growth rate at midpoint was significantly increased under hypoxic conditions compared to aerobic conditions in ATCC13950. Such long-term lag phase followed by short-term log phase suggests lower capacity for ATCC13950 to continue replication under hypoxic conditions compared to clinical MAC-PD strains.”

      Reviewer #4 (Public review):

      Comments on revisions:

      The revised version has satisfactorily addressed my initial comments in the discussion section.

      The authors thank the Reviewer for understanding our reply.

      Reviewer #5 (Public review):

      Comments on revisions:

      There is quite a lot of data and this could have been a really impactful study if the authors had channelized the Tn mutagenesis by focusing on one pathway or network. It looks scattered. However, from the previous version, the authors have made significant improvements to the manuscript and have provided comments that fairly address my questions.

      The authors thank the Reviewer for understanding our reply. And the authors thank the Reviewer for the comments suggesting the future studies of TnSeq that focus on one pathway or network.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review)

      (1) This manuscript addresses an important problem of the uncoupling of oxidative phosphorylation due to hypoxia-ischemia injury of the neonatal brain and provides insight into the neuroprotective mechanisms of hypothermia treatment.

      The authors used a combination of in vivo imaging of awake P10 mice and experiments on isolated mitochondria to assess various key parameters of the brain metabolism during hypoxia-ischemia with and without hypothermia treatment. This unique approach resulted in a comprehensive data set that provides solid evidence for the derived conclusions

      We thank the reviewer for the positive feedback.

      (2) The experiments were performed acutely on the same day when the surgery was performed. There is a possibility that the physiology of mice at the time of imaging was still affected by the previously applied anesthesia. This is particularly of concern since the duration of anesthesia was relatively long. Is it possible that the observed relatively low baseline OEF (~20%) and trends of increased OEF and CBF over several hours after the imaging start were partially due to slow recovery from prolonged anesthesia? The potential effects of long exposure to anesthesia before imaging experiments were not discussed.

      We thank the reviewer for this important comment and for pointing out the potential influence of anesthesia on the physiological state of the animals. We apologize for any confusion. To clarify, all PAM imaging experiments were conducted in awake animals. Isoflurane anesthesia was used only during two brief surgical procedures: (1) the installation of the head-restraint plastic head plate and (2) the right common carotid artery (CCA) ligation. Each anesthesia session lasted less than 20 minutes.

      We have revised the Methods section to provide additional details:

      For the subsection Procedures for PAM Imaging on page 17, we clarified the sequence of procedures during the head plate installation, as well as the corresponding anesthesia duration:

      “After the applied glue was solidified (~20 min), the animal was first returned to its cage for full recovery from anesthesia, and then carefully moved to the treadmill and secured to the metal arm-piece with two #4–40 screws for awake PAM imaging. The total duration of anesthesia, including preparation and glue solidification, was approximately 20 minutes.”

      For the subsection Neonatal Cerebral HI and Hypothermia Treatment on page 19, we also clarified the CCA ligation procedure:

      “Briefly, P10 mice of both sexes anesthetized with 2% isoflurane were subjected to the right CCA-ligation. To manage pain, 0.25% Bupivacaine was administered locally prior to the surgical procedures, which took less than 10 minutes. After a recovery period for one hour, awake mice were exposed to 10% O<sub>2</sub> for 40 minutes in a hypoxic chamber at 37 °C.”

      Regarding the reviewer’s concern about the observed trends in OEF and CBF, we agree that residual effects of anesthesia could, in principle, influence physiological parameters. However, we believe this is unlikely in this study for the following reasons. First, all imaging was conducted in awake animals after a clearly defined recovery period. Second, the trend of increasing OEF and CBF over time was consistent across animals and aligned with expected physiological responses following hypoxic-ischemic injury. In particular, the relatively low baseline OEF (0.21 at 37°C) is consistent with our previous study (0.25; (Cao et al., 2018)). The gradual increase in CBF and OEF reflects metabolic compensation and reperfusion following hypoxia-ischemia, as previously described (Lin and Powers, 2018). Therefore, we believe the observed changes are of physiological origin rather than anesthesia-related artifacts.

      (3) The Methods Section does not provide information about drugs administered to reduce the pain. If pain was not managed, mice could be experiencing significant pain during experiments in the awake state after the surgery. Since the imaging sessions were long (my impression based on information from the manuscript is that imaging sessions were ~4 hours long or even longer), the level of pain was also likely to change during the experiments. It was not discussed how significant and potentially evolving pain during imaging sessions could have affected the measurements (e.g., blood flow and CMRO<sub>2</sub>). If mice received pain management during experiments, then it was not discussed if there are known effects of used drugs on CBF, CMRO<sub>2</sub>, and lesion size after 24 hr.

      We thank the reviewer for this valuable comment regarding pain management. We confirm that local analgesia was administered to all animals prior to surgical procedures. Specifically, 0.25% Bupivacaine was applied locally before both the head-restraint plate installation and the CCA ligation. These details have now been clarified in the Methods section:

      For the subsection Procedures for PAM Imaging on page 16, we added:

      “To manage pain, 0.25% Bupivacaine was administered locally prior to the surgical procedures.”

      For the subsection Neonatal Cerebral HI and Hypothermia Treatment on page 18, we added:

      “To manage pain, 0.25% Bupivacaine was administered locally prior to the surgical procedures, which took less than 10 minutes.”

      To our knowledge, Bupivacaine has minimal systemic effects at the dose used and is unlikely to significantly alter CBF, CMRO<sub>2</sub>, or lesion development (Greenberg et al., 1998). No other analgesics (e.g., NSAIDs or opioids) were administered unless distress symptoms were observed—which did not occur in this study.

      Additionally, although imaging sessions were extended (up to 2 hours), animals remained calm and showed no signs of pain or distress during or after the procedures. Throughout the experimental period (up to 24 hours post-surgery), animals were monitored for signs of discomfort (e.g., abnormal activity, breathing, or weight gain), but no additional analgesia was required. The neonatal HI procedures are considered minimally invasive, and based on our protocol and prior experience, local Bupivacaine provides effective analgesia during and after the brief surgeries. We have added a corresponding note in the Discussion section (newly added subsection: Limitations in this study, the last paragraph) on page 15:

      “We observed no signs of distress or pain and did not use stress- or pain-reducing drugs during imaging. However, potential effects of stress or residual pain on CBF and CMRO<sub>2</sub> cannot be fully ruled out. Future studies could incorporate more detailed pain assessment and stress-mitigation strategies to further enhance physiological reliability.”

      (4) Animals were imaged in the awake state, but they were not previously trained for the imaging procedure with head restraint. Did animals receive any drugs to reduce stress? Our experience with well-trained young-adult as well as old mice is that they can typically endure 2 and sometimes up to 3 hours of head-restrained awake imaging with intermittent breaks for receiving the rewards before showing signs of anxiety. We do not have experience with imaging P10 mice in the awake state. Is it possible that P10 mice were significantly stressed during imaging and that their stress level changed during the imaging session? This concern about the potential effects of stress on the various measured parameters was not discussed.

      We thank the reviewer for this important comment regarding the potential effects of stress during awake imaging. The neonatal mice used in our study were P10, a stage at which animals are still physiologically immature and relatively inactive. Due to their small size and limited mobility, these animals did not struggle or show signs of distress during the imaging sessions. All animals remained calm and stable throughout the procedure, and no stress-reducing drugs were administered.

      We agree that, unlike older animals, P10 mice are not amenable to prior behavioral training. However, their underdeveloped motor activity and natural docility at this stage allowed for stable head-restrained imaging without inducing overt stress responses. Although no behavioral signs of stress were observed, we acknowledge that subtle physiological effects cannot be entirely excluded. We have added a brief discussion in the Discussion section (newly added subsection: Limitations in this study, the last paragraph) on page 15:

      “Lastly, for awake imaging, the small size of neonatal mice at P10 aids stability during awake PAM imaging, though it limits the feasibility of prior training, which is typically possible in older animals.”

      (5) The temperature of the skull was measured during the hypothermia experiment by lowering the water temperature in the water bath above the animal's head. Considering high metabolism and blood flow in the cortex, it could be challenging to predict cortical temperature based on the skull temperature, particularly in the deeper part of the cortex.

      We thank the reviewer for this helpful comment and for highlighting an important technical consideration. We acknowledge that we did not directly measure intracortical tissue temperature during the hypothermia experiments. While we recognize that relying on skull temperature may have limitations—particularly in reflecting temperature changes in deeper cortical regions—this approach is consistent with clinical practice, where intracortical temperature is typically not measured. Moreover, prior studies have shown that skull or brain surface temperature generally reflects cortical thermal dynamics to a reasonable extent under controlled conditions (Kiyatkin, 2007). We have added the following note in the Discussion section (newly added subsection: Limitations in this study, the 2<sup>nd</sup> paragraph) on page 14:

      “A technical limitation is the absence of direct intracortical temperature measurements during hypothermia; we relied on skull temperature, which may not fully capture temperature dynamics in deeper cortical layers. However, this approach aligns with clinical practice, where intracortical temperature is not typically measured. Future studies could benefit from more precise intracortical assessments.”

      (6) The map of estimated CMRO<sub>2</sub> (Fig. 4B) looks very heterogeneous across the brain surface. Is it a coincidence that the highest CMRO<sub>2</sub> is observed within the central part of the field of view? Is there previous evidence that CMRO<sub>2</sub> in these parts of the mouse cortex could vary a few folds over a 1-2 mm distance?

      We appreciate the reviewer’s insightful observation regarding the spatial heterogeneity observed in the estimated CMRO<sub>2</sub> map (Fig. 4B). This heterogeneity is not a result of scanning bias, as uniform contour scanning was performed across the entire field of view. The higher CMRO<sub>2</sub> values observed in the central region are unlikely to be artifacts and more likely reflect underlying physiological variability.

      Our CMRO<sub>2</sub> estimation is based on an algorithm we previously developed and validated in other tissues. Specifically, we have successfully applied this algorithm to assess oxygen metabolism in the mouse kidney (Sun et al., 2021) and to monitor vascular adaptation and tissue oxygen metabolism during cutaneous wound healing (Sun et al., 2022). These studies demonstrated the algorithm's capability to capture spatial variations in oxygen metabolism. Although the current application to the brain is novel, the algorithm has been validated in controlled experimental settings and shown to produce consistent results. We acknowledge that the observed range of CMRO<sub>2</sub> appears relatively broad across a 1–2 mm distance; however, such heterogeneity may arise from local differences in vascular density, metabolic demand, or tissue oxygenation — all of which can vary across cortical regions, even within small spatial scales. We have added a brief note in the Discussion (Subsection: Optical CMRO<sub>2</sub> detection in neonatal care) on page 13 to acknowledge this point:

      “Additionally, the spatial heterogeneity in estimated CMRO<sub>2</sub> observed in our data may reflect underlying physiological variability, including differences in vascular structure or metabolic demand across cortical regions. Future studies will aim to further validate and interpret these spatial patterns.”

      (7) The justification for using P10 mice in the experiments has not been well presented in the manuscript.

      We thank the reviewer for pointing out the need to clarify our choice of developmental stage. We chose P10 mice for our hypoxia-ischemia injury model because this stage is widely recognized as developmentally comparable to human term infants in terms of brain maturation. This approach has been validated by several previous studies (Clancy et al., 2007; Mallard and Vexler, 2015; Sheldon et al., 2018). We have added the following clarification to the Methods section (Subsection: Neonatal Cerebral HI and Hypothermia Treatment) on page 18:

      “P10 mice were chosen for our experiments as they are widely used to model near-term infants in humans. At this developmental stage, the brain maturation in mice closely parallels that of near-term infants, making them an appropriate model for studying neonatal brain injury and therapeutic interventions (Clancy et al., 2007; Mallard and Vexler, 2015; Sheldon et al., 2018).”

      (8) It was not discussed how the observations made in this manuscript could be affected by the potential discrepancy between the developmental stages of P10 mice and human babies regarding cellular metabolism and neurovascular coupling.

      We thank the reviewer for raising this important point regarding developmental differences between P10 mice and human infants. We have discussed this issue by adding the following statement to the Discussion section (newly added subsection: Limitations in this study, the 1<sup>st</sup> paragraph) on page 15, where we summarize the overall study design and model selection:

      “While P10 mice are widely used to model near-term human infants, developmental differences in cellular metabolism and neurovascular coupling may affect the observed outcomes and limit direct clinical translation (Clancy et al., 2007; Mallard and Vexler, 2015; Sheldon et al., 2018). Nevertheless, the P10 model remains a valuable and widely accepted tool for studying neonatal hypoxia-ischemia mechanisms and evaluating therapeutic interventions.”

      (9) Regarding the brain temperature measurements, the authors should use a new cohort of mice, implant the miniature thermocouples 1 mm, 0.5 mm, and immediately below the skull in different mice, and verify the temperature in the brain cortex under conditions applied in the experiments. The same approach could be applied to a few mice undergoing 4-hr-long hypothermia treatment in a chamber, which will provide information about the brain temperature that resulted in observed protection from the injury.

      We thank the reviewer for this helpful recommendation. We fully agree that direct intracortical temperature measurement would provide more accurate insight into thermal dynamics during hypothermia treatment. However, the primary aim of this study was not to characterize the precise intracortical temperature response under hypothermic conditions, but rather to examine the effects of hypothermia on CMRO<sub>2</sub> and mitochondrial function. Due to the substantial time and resources required to perform direct intracortical temperature monitoring—and considering the technical focus of the current work—we respectfully suggest reserving such investigations for a future study specifically focused on thermal dynamics in hypoxia-ischemia models.

      We have acknowledged this limitation in the subsection Limitations in this study of the Discussion on page 15, noting that skull temperature was used as an approximation of brain temperature and that this approach is consistent with clinical practice, where intracortical temperature is typically not measured. We also note that future studies may benefit from more precise assessments using intracortical probes.

      (10) The mean values presented in Fig. 4G are much lower than the peak values in the 2D panels and potentially were calculated as the average values over the entire field of view. Please provide more details on how CMRO<sub>2</sub> was estimated and if the validity of the measurements is expected across the entire field of view. If there are parts of the field of view where the estimation of CMRO<sub>2</sub> is more reliable for technical reasons, maybe one way to compute the mean values is to restrict the usable data to the more centralized part of the field of view.

      We thank the reviewer for this thoughtful comment. We confirm that CMRO<sub>2</sub> values shown in Figure 4G were calculated as spatial averages over the entire field of view (FOV; ~5 × 3 mm<sup>2</sup>) encompassing both hemicortices, as shown in Figure 1C. Regarding the observed CMRO<sub>2</sub> values, The apparent difference likely reflects a comparison between two different post-HI time points. Specifically, the ~0.5 value shown for the 37°C ipsilateral group in Figure 4G reflects the average CMRO<sub>2</sub> measured 24 hours after HI, while the ~1.5 value in Figure 2D (red line) corresponds to CMRO<sub>2</sub> during the early 0–2 hour post-HI period. The temporal difference accounts for the apparent discrepancy in magnitude. We understand the importance of consistency across the field of view and have clarified this point in the subsection Procedures for PAM Imaging in the Methods on page 17 “For the imaging field covering both hemicortices between the Bregma and Lambda of the neonatal mouse (5 × 3 mm<sup>2</sup> as shown in Figure 1C, with each hemicortex measuring 2.5 × 3 mm<sup>2</sup>)”, as well as in the Figure 4 legend on page 34 “Correlation of CMRO<sub>2</sub> and post-HI brain infarction in mouse neonates at 24 hours”.

      In our model and setup, CMRO<sub>2</sub> estimation is spatially robust across the FOV under standard imaging conditions. We recognize, however, that certain peripheral regions may be more prone to signal attenuation. Future refinement of region selection could further improve spatial averaging strategies. For the current study, full-FOV averaging was used consistently across all groups to maintain comparability.

      (11) Minor: Results presented in Supplementary Tables have too many significant digits.

      Thank you for the helpful suggestion. We have revised Supplementary Tables S1 and S2 to reduce the number of significant digits and improve clarity.

      Reviewer #2 (Public review)

      (1) In this study, authors have hypothesized that mitochondrial injury in HIE is caused by OXPHOS-uncoupling, which is the cause of secondary energy failure in HI. In addition, therapeutic hypothermia rescues secondary energy failure. The methodologies used are state-of-the art and include PAM technique in live animal, bioenergetic studies in the isolated mitochondria, and others.

      The study is comprehensive and impressive. The article is well written and statistical analyses are appropriate.

      We thank the reviewer for the positive feedback.

      (2) The manuscript does not discuss the limitation of this animal model study in view of the clinical scenario of neonatal hypoxia-ischemia.

      We thank the reviewer for this valuable feedback. In response, we have added a dedicated “Limitations in this study” subsection in the Discussion, where we address the potential limitations of this animal model in the context of the clinical scenario of neonatal hypoxia-ischemia in the first paragraph on page 14, including the developmental differences between P10 mice and human infants.

      (3) I see many studies on Pubmed on bioenergetics and HI. Hence, it is unclear what is novel and what is known.

      We thank the reviewer for this important comment regarding the novelty of our study in the context of existing research on bioenergetics and hypoxia-ischemia (HI). To better clarify the novel aspects of our work, we have highlighted the relevant content in the Abstract (page 4) and Introduction (page 5). Specifically, while many studies have explored HI-related bioenergetic dysfunction, the mechanisms by which therapeutic hypothermia modulates CMRO<sub>2</sub> and mitochondrial function post-HI remain poorly understood.

      Abstract on page 4: “However, it is unclear how post-HI hypothermia helps to restore the balance, as cooling reduces CMRO<sub>2</sub>. Also, how transient HI leads to secondary energy failure (SEF) in neonatal brains remains elusive. Using photoacoustic microscopy, we examined the effects of HI on CMRO<sub>2</sub> in awake 10-day-old mice, supplemented by bioenergetic analysis of purified cortical mitochondria.”

      Introduction on page 5: “The use of awake mouse neonates avoided the confounding effects of anesthesia on CBF and CMRO<sub>2</sub> (Cao et al., 2017; Gao et al., 2017; Sciortino et al., 2021; Slupe and Kirsch, 2018). In addition, we measured the oxygen consumption rate (OCR), reactive oxygen species (ROS), and the membrane potential of mitochondria that were immediately purified from the same cortical area imaged by PAM. This dual-modal analysis enabled a direct comparison of cerebral oxygen metabolism and cortical mitochondrial respiration in the same animal. Moreover, we compared the effects of therapeutic hypothermia on oxygen metabolism and mitochondrial respiration, and correlated the extent of CMRO<sub>2</sub>-reduction with the severity of infarction at 24 hours after HI. Our results suggest that blocking HI-induced OXPHOS-uncoupling is an acute effect of hypothermia and that optical detection of CMRO<sub>2</sub> may have clinical applications in HIE.”

      In this study, we propose that uncoupled oxidative phosphorylation (OXPHOS) underlies the secondary energy failure observed after HI, and we demonstrate that hypothermia suppresses this pathological CMRO<sub>2</sub> surge, thereby protecting mitochondrial integrity and preventing injury. Additionally, our use of photoacoustic microscopy (PAM) in awake neonatal mice represents a novel, non-invasive approach to track cerebral oxygen metabolism, with potential clinical relevance for guiding hypothermia therapy.

      (4) What are the limitations of ex-vivo mitochondrial studies?

      We thank the reviewer for this insightful comment. We acknowledge that ex-vivo mitochondrial assays do not fully replicate in vivo physiological conditions, as they lack systemic factors such as blood flow, cellular interactions, and intact tissue architecture. However, these assays are well-established and widely accepted in the field for evaluating mitochondrial function under controlled conditions (Caspersen et al., 2008; Niatsetskaya et al., 2012). Despite their limitations, they enable direct comparisons of mitochondrial activity across experimental groups and provide valuable mechanistic insights that complement in vivo observations.

      (5) PAM technique limits the resolution of the image beyond 500-750 micron depth. Assessing basal ganglia may not be possible with this approach?

      We thank the reviewer for this important comment. We agree that the imaging depth of PAM is limited and may not allow assessment of deeper brain structures such as the basal ganglia. However, in our neonatal HI model—as in many clinical cases of HIE—cortical injury is typically more severe and represents a major focus for mechanistic and therapeutic investigations. The cortical regions assessed with PAM are thus highly relevant to the pathophysiology of neonatal HI. We have now acknowledged this depth limitation in the third paragraph of the newly added Limitations in this study subsection of the Discussion on page 15:

      “Another limitation of this study is the restricted imaging depth of the PAM technique, which is typically less than 1 mm and therefore does not allow assessment of deeper brain structures such as the basal ganglia. However, in both our neonatal HI model and most clinical cases of neonatal hypoxia-ischemia, cortical injury tends to be more prominent and functionally significant. As such, our cortical measurements remain highly relevant for investigating the mechanisms of injury and evaluating therapeutic interventions.”

      (6) Hypothermia in present study reduces the brain temperature from 37 to 29-32 degree centigrade. In clinical set up, head temp is reduced to 33-34.5 in neonatal hypoxia ischemia. Hence a drop in temperature to 29 degrees is much lower relative to the clinical practice. How the present study with greater drop in head temperature can be interpreted for understanding the pathophysiology of therapeutic hypothermia in neonatal HIE. Moreover, in HIE model using higher temperature of 37 and dropping to 29 seems to be much different than the clinical scenario. Please discuss.

      We thank the reviewer for raising this important point regarding temperature ranges in our study. In Figure 1, we used a broader temperature range (down to 29°C) to explore the general relationship between temperature and CMRO<sub>2</sub> in uninjured neonatal mice. This experiment was not intended to model therapeutic hypothermia directly, but rather to characterize the baseline physiological responses.

      For all experiments involving hypothermia as a therapeutic intervention following HI, we consistently maintained a brain temperature of 32°C, which falls within the clinically accepted mild hypothermia range for neonatal HIE (typically 33–34.5°C). We believe this temperature closely mimics clinical practice and supports the translational relevance of our findings.

      (7) NMR was assessed ex-vivo. How does it relate to in vivo assessment. Infants admitted in Neonatal intensive Care Unit, frequently get MRI with spectroscopy. How do the MRS findings in human newborns with HIE correlate with the ex-vivo evaluation of metabolites.

      We thank the reviewer for this insightful question. While our study assessed brain metabolites ex vivo, similar metabolic changes have been observed in vivo using proton magnetic resonance spectroscopy (¹H-MRS) in infants with HIE. Specifically, reductions in N-acetylaspartate (NAA) — a marker of neuronal integrity — have been reported in neonates with severe brain injury, aligning with our ex vivo findings. This correlation between in vivo and ex vivo assessments supports the translational relevance of our model for studying metabolic disruption in neonatal HIE. We have added this point to the subsection Using Optically Measured CMRO<sub>2</sub> to Detect Neonatal HI Brain Injury of the Results on page 8, along with a supporting clinical reference (Lally et al., 2019):

      “In addition, in vivo proton MRS in infants with HIE has also shown a reduction in NAA, particularly in cases of severe injury (Lally et al., 2019). This reduction in NAA, observed in neonatal intensive care settings, reflects neuronal and axonal loss or dysfunction and serves as a biomarker for injury severity. The alignment between our ex vivo observations and in vivo MRS findings in clinical studies reinforces the translational relevance of our model for investigating metabolic disturbances in neonatal HIE.”

      Reviewer #3 (Public review)

      (1) In Sun et al. present a comprehensive study using a novel photoacoustic microscopy setup and mitochondrial analysis to investigate the impact of hypoxia-ischemia (HI) on brain metabolism and the protective role of therapeutic hypothermia. The authors elegantly demonstrate three connected findings: (1) HI initially suppresses brain metabolism, (2) subsequently triggers a metabolic surge linked to oxidative phosphorylation uncoupling and brain damage, and (3) therapeutic hypothermia mitigates HI-induced damage by blocking this surge and reducing mitochondrial stress.

      The study's design and execution are great, with a clear presentation of results and methods. Data is nicely presented, and methodological details are thorough.

      We thank the reviewer for the positive feedback.

      (2) However, a minor concern is the extensive use of abbreviations, which can hinder readability. As all the abbreviations are introduced in the text, their overuse may render the text hard to read to non-specialist audiences. Additionally, sharing the custom Matlab and other software scripts online, particularly those used for blood vessel segmentation, would be a valuable resource for the scientific community. In addition, while the study focuses on the short-term effects of HI, exploring the long-term consequences and definitively elucidating HI's impact on mitochondria would further strengthen the manuscript's impact.

      We thank the reviewer for these valuable suggestions. Please find our point-by-point responses below:

      Abbreviations: To improve readability, we have added a List of Abbreviations on page 3 to help readers, especially non-specialists, navigate the terminology more easily.

      MATLAB Code Availability: The methodology for blood vessel segmentation was described in detail in our previous publication (Sun et al., 2020). We have now updated the subsection Quantification of Cerebral Hemodynamics and Oxygen Metabolism by PAM of the Methods on page 18 to provide additional details and have indicated that the MATLAB scripts are available upon request.

      “Briefly, this process involves generating a vascular map using signal amplitude from the Hilbert transformation, selecting a region slightly larger than the vessel of interest, and applying Otsu’s thresholding method to remove background pixels. Isolated or spurious boundary fragments are then removed to improve boundary smoothness. The customized MATLAB code used for vessel segmentation is available upon request.”

      Long-Term Effects of Hypothermia: We agree that exploring long-term outcomes would enhance the broader impact of this research. While our study focuses on the acute phase following HI, prior studies have shown long-term neuroprotective benefits of therapeutic hypothermia, such as enhanced white matter development (Koo et al., 2017). We have added this point to the fourth paragraph in the subsection Limitations in this study of the Discussion on page 15:

      “While our study focuses on the acute effects of hypothermia, previous research has shown long-term neuroprotective benefits, including improved white matter development post-injury (Koo et al., 2017). These findings highlight hypothermia's potential for both immediate and extended recovery, warranting further study of long-term outcomes.”

      (3) Extensive use of abbreviations.

      Thank you for the helpful suggestion. To improve readability for a broader audience, we have added a List of Abbreviations on page 3 of the manuscript to assist readers in navigating terminology used throughout the text. This has been included as Response #2 to Reviewer #3.

      (4) Share code used to conduct the study.

      Thank you for the suggestion. The methodology for vessel segmentation was previously published (Sun et al., 2020), and we have noted in the subsection Quantification of Cerebral Hemodynamics and Oxygen Metabolism by PAM of the Methods on page 18 that the MATLAB code is available upon request. This has also been included as Response #2 to Reviewer #3.

      Reference:

      Cao R, Li J, Kharel Y, Zhang C, Morris E, Santos WL, Lynch KR, Zuo Z, Hu S. 2018. Photoacoustic microscopy reveals the hemodynamic basis of sphingosine 1-phosphate-induced neuroprotection against ischemic stroke. Theranostics 8:6111–6120. doi:10.7150/thno.29435

      Caspersen CS, Sosunov A, Utkina-Sosunova I, Ratner VI, Starkov AA, Ten VS. 2008. An Isolation Method for Assessment of Brain Mitochondria Function in Neonatal Mice with Hypoxic-Ischemic Brain Injury. Developmental Neuroscience 30:319–324. doi:10.1159/000121416

      Clancy B, Kersh B, Hyde J, Darlington RB, Anand KJS, Finlay BL. 2007. Web-based method for translating neurodevelopment from laboratory species to humans. Neuroinformatics 5:79–94. doi:10.1385/ni:5:1:79

      Greenberg RS, Zahurak M, Belden C, Tunkel DE. 1998. Assessment of oropharyngeal distance in children using magnetic resonance imaging. Anesth Analg 87:1048–1051. doi:10.1097/00000539-199811000-00014

      Kiyatkin EA. 2007. Brain temperature fluctuations during physiological and pathological conditions. Eur J Appl Physiol 101:3–17. doi:10.1007/s00421-007-0450-7

      Koo E, Sheldon RA, Lee BS, Vexler ZS, Ferriero DM. 2017. Effects of therapeutic hypothermia on white matter injury from murine neonatal hypoxia-ischemia. Pediatr Res 82:518–526. doi:10.1038/pr.2017.75

      Lally PJ, Montaldo P, Oliveira V, Soe A, Swamy R, Bassett P, Mendoza J, Atreja G, Kariholu U, Pattnayak S, Sashikumar P, Harizaj H, Mitchell M, Ganesh V, Harigopal S, Dixon J, English P, Clarke P, Muthukumar P, Satodia P, Wayte S, Abernethy LJ, Yajamanyam K, Bainbridge A, Price D, Huertas A, Sharp DJ, Kalra V, Chawla S, Shankaran S, Thayyil S, MARBLE consortium. 2019. Magnetic resonance spectroscopy assessment of brain injury after moderate hypothermia in neonatal encephalopathy: a prospective multicentre cohort study. Lancet Neurol 18:35–45. doi:10.1016/S1474-4422(18)30325-9

      Lin W, Powers WJ. 2018. Oxygen metabolism in acute ischemic stroke. J Cereb Blood Flow Metab 38:1481–1499. doi:10.1177/0271678X17722095

      Mallard C, Vexler Z. 2015. Modeling ischemia in the immature brain: how translational are animal models? Stroke 46:3006–3011. doi:10.1161/STROKEAHA.115.007776

      Niatsetskaya ZV, Sosunov SA, Matsiukevich D, Utkina-Sosunova IV, Ratner VI, Starkov AA, Ten VS. 2012. The Oxygen Free Radicals Originating from Mitochondrial Complex I Contribute to Oxidative Brain Injury Following Hypoxia–Ischemia in Neonatal Mice. J Neurosci 32:3235–3244. doi:10.1523/JNEUROSCI.6303-11.2012

      Sheldon RA, Windsor C, Ferriero DM. 2018. Strain-Related Differences in Mouse Neonatal Hypoxia-Ischemia. Dev Neurosci 40:490–496. doi:10.1159/000495880

      Sun N, Bruce AC, Ning B, Cao R, Wang Y, Zhong F, Peirce SM, Hu S. 2022. Photoacoustic microscopy of vascular adaptation and tissue oxygen metabolism during cutaneous wound healing. Biomed Opt Express, BOE 13:2695–2706. doi:10.1364/BOE.456198

      Sun N, Ning B, Bruce AC, Cao R, Seaman SA, Wang T, Fritsche-Danielson R, Carlsson LG, Peirce SM, Hu S. 2020. In vivo imaging of hemodynamic redistribution and arteriogenesis across microvascular network. Microcirculation 27:e12598. doi:10.1111/micc.12598

      Sun N, Zheng S, Rosin DL, Poudel N, Yao J, Perry HM, Cao R, Okusa MD, Hu S. 2021. Development of a photoacoustic microscopy technique to assess peritubular capillary function and oxygen metabolism in the mouse kidney. Kidney International 100:613–620. doi:10.1016/j.kint.2021.06.018

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      In this study, the authors identified an insect salivary protein LssaCA participating viral initial infection in plant host. LssaCA directly bond to RSV nucleocapsid protein and then interacted with a rice OsTLP that possessed endo-β-1,3-glucanase activity to enhance OsTLP enzymatic activity and degrade callose caused by insects feeding. The manuscript suffers from fundamental logical issues, making its central narrative highly unconvincing.

      (1) These results suggested that LssaCA promoted RSV infection through a mechanism occurring not in insects or during early stages of viral entry in plants, but in planta after viral inoculation. As we all know that callose deposition affects the feeding of piercing-sucking insects and viral entry, this is contradictory to the results in Fig. S4 and Fig. 2. It is difficult to understand callose functioned in virus reproduction in 3 days post virus inoculation. And authors also avoided to explain this mechanism.

      We appreciate your insightful comment and acknowledge that our initial description may not have been sufficiently clear.

      (1) Based on the EPG results, we found that LssaCA deficiency did not significantly affect total feeding time, time to first non-phloem phase, or time to first phloem feeding (Fig. S8A-D in the revised manuscript). However, the continuity of sap ingestion was disturbed—the N4 waveform of dsLssaCA SBPHs was occasionally interrupted for brief periods (newly added Fig. S8E in the revised manuscript), likely due to phloem blockage. In the revised manuscript, we have added this analysis to the Result section (Lines 285-291 and 578-587) and provided the EPG procedure in Material and Methods section (Lines 670-680).

      (2) We assessed RSV titers immediately post-feeding to confirm the inoculation viral loads (Fig. 2G) and at 3 dpf (Fig. 2H-I) to assess the in-planta effects following viral inoculation. This did not mean that callose functions in virus reproduction at 3 days post viral inoculation. Rather, callose deposition typically occurs immediately in response to insect feeding and virus inoculation. When measuring callose deposition, we allowed insects to feed for 24 h and quantified the callose levels immediately post feeding. The EPG results showed that sap ingestion continuity was disrupted—the N4 waveform of dsLssaCA-treated SBPHs was occasionally interrupted for brief periods (newly added Fig. S8E in the revised manuscript), likely due to phloem blockage. We have reorganized the description to avoid confusion. Please see Lines 139-144 and Fig. S8E for detail.

      (1) Missing significant data. For example, the phenotypes of the transgenic plants, the RSV titers in the transgenic plants (OsTLP OE, ostlp). The staining of callose deposition were also hard to convince. The evidence about RSV NP-LssaCA-OsTLP tripartite interaction to enhance OsTLP enzymatic activity is not enough.

      We thank the reviewer for this insightful comment.

      (1) We constructed OsTLP overexpression and mutant transgenic plants (OsTLP OE and ostlp) and assessed their phenotypes regarding RSV infection levels. Compared with wild-type plants, OsTLP OE plants exhibited accelerated growth, while ostlp plants showed growth inhibition. Following feeding by viruliferous L. striatellus, OsTLP OE plants had significantly higher RSV titers compared with wild-type plants, whereas ostlp mutant plants exhibited significantly lower RSV titers (Lines 221-228 and new Fig. 3I). These results indicate that OsTLP facilitates RSV infection in planta.

      (2) The images showing callose deposition staining are representative of 15 images from 3 independent insect treatments. In addition to the staining images, we quantified fluorescence intensity and measured callose concentration by ELISA.

      (2)  Figure 4a, there was the LssaCA signal in the fourth lane of pull-down data. Did MBP also bind LsssCA? The characterization of pull-down methods was rough a little bit. The method of GST pull-down and MBP pull-down should be characterized more in more detail.

      We thank the reviewer for this helpful comment. MBP did not bind LssaCA. We have repeated the pull-down experiment and provide clearer figure with improved results. We have also revised and provided more detailed descriptions of the GST pull-down and MBP pull-down methods. Please refer to Lines 744-774 and Figure 4A for details.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review): 

      The medicinal leech preparation is an amenable system in which to understand how the underlying cellular networks for locomotion function. A previously identified non-spiking neuron (NS) was studied and found to alter the mean firing frequency of a crawl-related motoneuron (DE-3), which fires during the contraction phase of crawling. The data are mostly solid. Identifying upstream neurons responsible for crawl motor patterning is essential for understanding how rhythmic behavior is controlled.

      Review of Revision: 

      On a positive note, the rationale for the study is clearer to me now after reading the authors' responses to both reviewers, but that information, as described in the authors' responses, is minimally incorporated into the current revised paper. Incorporating a discussion of previous work on the NS cell has, indeed, improved the paper. 

      I suggested earlier that the paper be edited for clarity but not much text has been changed since the first draft. I will provide an example of the types of sentences that are confusing. The title of the paper is: "Phase-specific premotor inhibition modulates leech rhythmic motor output". Are the authors referring to the inhibition created by premotor neurons (e.g., on to the motoneurons) or the inhibition that the premotor neurons receive? 

      In this case, this is an interesting ambiguity: NS is inhibited and that inhibition is directly transmitted to the motoneurons because both cells are electrically coupled.  We believe that the title does not disguise the findings conveyed by the manuscript.

      I also find the paper still confusing with regard to the suggested "functional homology" with the vertebrate Renshaw cells. When the authors set up this expectation of homology (should be analogy) in the introduction and other sections of the paper, one would assume that the NS cell would be directly receiving excitation from a motoneuron (like DE-3) and, in turn, the motoneuron would then receive some sort of inhibitory input to regulate its firing frequency. Essentially, I have always viewed the Renshaw cells as nature's clever way to monitor the ongoing activity of a motoneuron while also providing recurrent feedback or "recurrent inhibition" to modify that cell's excitatory state. The authors present their initial idea below on line 62. Authors write: "These neurons are present as bilateral pairs in each segmental ganglion and are functional homologs of the mammalian Renshaw cells (Szczupak, 2014). These spinal cord cells receive excitatory inputs from motoneurons and, in turn, transmit inhibitory signals to the motoneurons (Alvarez and Fyffe, 2007)." 

      We agree with Reviewer #2: the correct term is "analogous," not "homologous." Thanks for pointing this out. We changed the term throughout the text.

      The Reviewer is also right in the appreciation of the role of Renshaw cells. NS plays exactly the role that the Reviewer expresses. The ONLY difference is that NS is inhibited by the motoneurons, and in turn transmits this inhibition to the motoneurons via the rectifying electrical junctions. Attending the confusion that our description caused in the Reviewer, we have modified the cited sentence accordingly now in lines 65-67.

      Minor note:

      I suggest re-writing this last sentence as "these" is confusing. Change to: 'In the spinal cord, Renshaw interneurons receive excitatory inputs from motoneurons and, in turn, transmit inhibitory signals to them (Alvarez and Fyffe, 2007).'] 

      Please, see the changes mentioned above.

      Furthermore, the authors note that (line 69 on): "In the context of this circuit the activity of excitatory motoneurons evokes chemically mediated inhibitory synaptic potentials in NS. Additionally, the NS neurons are electrically coupled......In physiological conditions this coupling favors the transmission of inhibitory signals from NS to motoneurons." Based on what is being conveyed here, I see a disconnect with the "functional homology" being presented earlier. I may be missing something, but the Renshaw analogy seems to be quite different compared to what looks like reciprocal inhibition in the leech. If the authors want to make the analogy to Renshaw cells clearer, then they should make a simple ball and stick diagram of the leech system and visually compare it to the Renshaw/motoneuron circuit with regard to functionality. This simple addition would help many readers. 

      We have simplified the description regarding the Renshaw cell (lines 65-67) to avoid the “details” of the connectivity between the two circuits.

      This report focuses on NS neurons and their role in crawling; we mention the analogy with Renshaw cells to widen the interest of the results. We do not think that making a special diagram to compare how the two neurons play a similar role via different connections among the players is useful in the context of this manuscript.

      The Abstract, Authors write (line 19), "Specifically, we analyzed how electrophysiological manipulation of a premotor nonspiking (NS) neuron, that forms a recurrent inhibitory circuit (homologous to vertebrate Renshaw cells)...."

      First, a circuit would not be homologous to a cell, and the term homology implies a strict developmental/evolutionary commonality. At best, I would use the term functionally analogous but even then I am still not sure that they are functionally that similar (see comments above). 

      Reviewer #2 is right. We changed the sentence in line 20.

      Line 22: "The study included a quantitative analysis of motor units active throughout the fictive crawling cycle that shows that the rhythmic motor output in isolated ganglia mirrors the phase relationships observed in vivo." This sentence must be revised to indicate that not all of the extracellular units were demonstrated to be motor units. Revise to: "The study included a quantitative analysis of identified and putative motor units active throughout the fictive crawling cycle that shows.....' 

      Line 187 regarding identifying units as motoneurons: Authors write, "While multiple extracellular recordings have been performed previously (Eisenhart et al., 2000), these results (Figure 4) present the first quantitative analysis of motor units activated throughout the crawling cycle in this type of recordings." The authors cannot assume that the units in the recorded nerves belong only to motoneurons. Based on their first rebuttal, the authors seem to be reluctant to accept the idea that the extracellularly recorded units might represent a different class of neurons. They admit that some sensory neurons (with somata located centrally) do, indeed, travel out the same nerves recorded, but go on to explain why they would not be active. 

      The leech has a variety of sensory organs that are located in the periphery, and some of these sensory neurons do show rhythmic activity correlated with locomotor activity (see Blackshaw's early work). The numerous stretch receptors, in fact, have very large axons that pass through all the nerves recorded in the current paper. 

      In Fig. 4, it is interesting that the waveforms of all the units recorded in the PP nerve exhibit a reversal in waveform as compared to those in the DP nerve, which might indicate (based on bipolar differential recording) that the units in the PP nerve are being propagated in the opposite direction (i.e., are perhaps afferent). Rhythmic presynaptic inhibition and excitation is commonly seen for stretch receptors within the CNS (see the work of Burrows) and many such cells are under modulatory control. 

      Most likely, the majority of the units are from motoneurons, but we do not really know at this point. The authors should reframe their statements throughout the paper as: 'While multiple extracellular recordings have been performed previously (Eisenhart et al., 2000), these results (Figure 4) present the first quantitative analysis of multiple extracellular units, using spike sorting methods, which are activated throughout the crawling cycle.' In cases where the identity of the unit is known, then it is fine to state that, but when the identity of the unit is not known, then there should be some qualification and stated as 'putative motor units' 

      We understand the concern of Reviewer #2 regarding the type of neurons active during dopamine-induced crawling in isolated ganglia. However, we believe there is sufficient evidence to support that the recorded spikes originate from motoneurons. As readers may share the same concern, we have added a paragraph explaining why spikes from somatic sensory neurons such as P or T cells, or from stretch receptors, are unlikely to contribute (lines 206-214). We included the term putative in the abstract.

      The Methods section:

      Needs to include the full parameters that were used to assess whether bursting activity was qualified in ways to be considered crawling activity or not. Typically, crawl-like burst periods of no more than 25 seconds have been the limit for their qualification as crawling activity. In Fig 2F, for example, the inter-burst period is over 35 seconds; that coupled with an average 5 second burst duration would bring the burst period to 40 seconds, which is substantially out of range for there to be bursting relevant to crawl activity. Simply put, long DE-3 burst periods are often observed but may not be indicative of a crawl state as the CV motoneurons are no longer out of phase with DE-3. A number of papers have adopted this criterion. 

      We now indicate in the methods the range of period values measured in our experiments.  For the reviewer informatio we show here histograms depicting the variability of period and duty cycle values recorded in our experiments (control conditions). The Reviewer can see that the bursting activity of DE-3 fall within what has been published.

      Author response image 1.

      Crawling in isolated ganglia. A. Histogram of periods end-to-end during crawling in isolated ganglia. The dotted line indicates the mean obtained from the averages of all experiments. The solid black line represents the mean of all cycles across all experiments. B. As in A, for the duty cycle calculated using end-to-end periods.  (n = 210 cycles from 45 ganglia obtained from 32 leeches in all cases).

      Reviewer #1 (Recommendations for the authors): 

      Minor comments-

      Line 100: "In the frame of the recurrent inhibitory circuit, NS is the target of inhibitory signals". Suggestion: 'Within the framework of the recurrent inhibitory circuit, NS is the target of inhibitory signals.' 

      Changed as suggested (line 107).

      Line 163: "This series of experiments proves that, as predicted based on the known circuit (Figure 164 1C), inhibitory signals onto NS premotor neurons were transmitted to DE-3 motoneurons and counteracted their excitatory drive during crawling, limiting their firing frequency". I think this sentence is too strong plus needs some editing. Suggestion: 'As predicted based on the known circuit (Figure 164 1C), this series of experiments indicates that inhibitory signals onto NS premotor neurons are transmitted to DE-3 motoneurons, thus limiting their firing frequency and counteracting their excitatory drive during crawling."

      Changed as suggested.

      Lines 86, 292 and 304 and Fig 4 legend: "Different from DE-3, In-Phase units showed a marked decrease in the maximum bFF along time." Suggestion: Replace the word "along" with 'across' time. Also replace those words in the Fig 4 legend and Line 80...."along" (replace with 'across') the different stages of crawling. 

      Changed as suggested.

      Line 311: "bursts and a concurrent inhibitory input via NS (Figure 7). Coherent with this interpretation, the activity level of the Anti- Phase units was not influenced by these inhibitory signals". Suggestion: Replace the word "coherent" with 'consistent'. 

      Changed as suggested.

      Line 332: "...offer the particular advantage of allowing electrical manipulation of individual neurons in wildtype adults," I am unsure what the authors are attempting to convey. Not sure what they mean by "wildtype" in this context and why that would matter. 

      “wildtype” was eliminated

      We thank Reviewer #2 for the suggested edits to the text.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This study advances the lab's growing body of evidence exploring higher-order learning and its neural mechanisms. They recently found that NMDA receptor activity in the perirhinal cortex was necessary for integrating stimulus-stimulus associations with stimulus-shock associations (mediated learning) to produce preconditioned fear, but it was not necessary for forming stimulus-shock associations. On the other hand, basolateral amygdala NMDA receptor activity is required for forming stimulus-shock memories. Based on these facts, the authors assessed: (1) why the perirhinal cortex is necessary for mediated learning but not direct fear learning, and (2) the determinants of perirhinal cortex versus basolateral amygdala necessity for forming direct versus indirect fear memories. The authors used standard sensory preconditioning and variants designed to manipulate the novelty and temporal relationship between stimuli and shock and, therefore, the attentional state under which associative information might be processed. Under experimental conditions where information would presumably be processed primarily in the periphery of attention (temporal distance between stimulus/shock or stimulus pre-exposure), perirhinal cortex NMDA receptor activation was required for learning indirect associations. On the other hand, when information would likely be processed in focal attention (novel stimulus contiguous with shock), basolateral amygdala NMDA activity was required for learning direct associations. Together, the findings indicate that the perirhinal cortex and basolateral amygdala subserve peripheral and focal attention, respectively. The authors provide support for their conclusions using careful, hypothesis-driven experimental design, rigorous methods, and integrating their findings with the relevant literature on learning theory, information processing, and neurobiology. Therefore, this work will be highly interesting to several fields.

      Strengths:

      (1) The experiments were carefully constructed and designed to test hypotheses that were rooted in the lab's previous work, in addition to established learning theory and information processing background literature.

      (2) There are clear predictions and alternative outcomes. The provided table does an excellent job of condensing and enhancing the readability of a large amount of data.

      (3) In a broad sense, attention states are a component of nearly every behavioral experiment. Therefore, identifying their engagement by dissociable brain areas and under different learning conditions is an important area of research.

      (4) The authors clearly note where they replicated their own findings, report full statistical measures, effect sizes, and confidence intervals, indicating the level of scientific rigor.

      (5) The findings raise questions for future experiments that will further test the authors' hypotheses; this is well discussed.

      Weaknesses:

      As a reader, it is difficult to interpret how first-order fear could be impaired while preconditioned fear is intact; it requires a bit of "reading between the lines".

      We appreciate the Reviewer’s point and have attempted to address on lines 55-63 of the revised paper: “In a recent pair of studies, we extended these findings in two ways. First, we showed that S1 does not just form an association with shock in stage 2; it also mediates an association between S2 and the shock. Thus, S2 enters testing in stage 3 already conditioned, able to elicit fear responses (Wong et al., 2019). Second, we showed that this mediated S2-shock association requires NMDAR-activation in the PRh, as well as communication between the PRh and BLA (Wong et al., 2025). These findings raise two critical questions: 1) why is the PRh engaged for mediated conditioning of S2 but not for direct conditioning of S1; and 2) more generally, what determines whether the BLA and/or PRh is engaged for conditioning of the S1 and/or S2?”

      Reviewer #2 (Public review):

      Summary:

      This paper continues the authors' research on the roles of the basolateral amygdala (BLA) and the perirhinal cortex (PRh) in sensory preconditioning (SPC) and second-order conditioning (SOC). In this manuscript, the authors explore how prior exposure to stimuli may influence which regions are necessary for conditioning to the second-order cue (S2). The authors perform a series of experiments which first confirm prior results shown by the author - that NMDA receptors in the PRh are necessary in SPC during conditioning of the first-order cue (S1) with shock to allow for freezing to S2 at test; and that NMDA receptors in the BLA are necessary for S1 conditioning during the S1-shock pairings. The authors then set out to test the hypothesis that the PRh encodes associations in a peripheral state of attention, whereas the BLA encodes associations in a focal state of attention, similar to the A1 and A2 states in Wagner's theory of SOP. To do this, they show that BLA is necessary for conditioning to S2 when the S2 is first exposed during a serial compound procedure - S2-S1-shock. To determine whether pre-exposure of S2 will shift S2 to a peripheral focal state, the authors run a design in which S2-S1 presentations are given prior to the serial compound phase. The authors show that this restores NMDA receptor activity within the PRh as necessary for the fear response to S2 at test. They then test whether the presence of S1 during the serial compound conditioning allows the PRh to support the fear responses to S2 by introducing a delay conditioning paradigm in which S1 is no longer present. The authors find that PRh is no longer required and suggest that this is due to S2 remaining in the primary focal state.

      Strengths:

      As with their earlier work, the authors have performed a rigorous series of experiments to better understand the roles of the BLA and PRh in the learning of first- and second-order stimuli. The experiments are well-designed and clearly presented, and the results show definitive differences in functionality between the PRh and BLA. The first experiment confirms earlier findings from the lab (and others), and the authors then build on their previous work to more deeply reveal how these regions differ in how they encode associations between stimuli. The authors have done a commendable job of pursuing these questions.

      Table 1 is an excellent way to highlight the results and provide the reader with a quick look-up table of the findings.

      Weaknesses:

      The authors have attempted to resolve the question of the roles of the PRh and BLA in SPC and SOC, which the authors have explored in previous papers. Laudably, the authors have produced substantial results indicating how these two regions function in the learning of first- and second-order cues, providing an opportunity to narrow in on possible theories for their functionality. Yet the authors have framed this experiment in terms of an attentional framework and have argued that the results support this particular framework and hypothesis - that the PRh encodes peripheral and the BLA encodes focal states of learning. This certainly seems like a viable and exciting hypothesis, yet I don't see why the results have been completely framed and interpreted this way. It seems to me that there are still some alternative interpretations that are plausible and should be included in the paper.

      We appreciate the Reviewer’s point and have attempted to address it on lines 566-594 of the Discussion: “An additional point to consider in relation to Experiments 3A, 3B, 4A and 4B is the level of surprise that rats experienced following presentations of the familiar S2 in stage 2. Specifically, in Experiments 3A and 3B, S2 was followed by the expected S1 (low surprise) and its conditioning required activation of NMDA receptors in the PRh and not the BLA. By contrast, in Experiments 4A and 4B, S2 was followed by omission of the expected S1 (high surprise) and its conditioning required activation of NMDA receptors in the BLA and not the PRh. This raises the possibility that surprise, or prediction error, also influences the way that S2 is processed in focal and peripheral states of attention. When prediction error is low, S2 is processed in the peripheral state of attention: hence, learning under these circumstances requires NMDA receptor activation in the PRh and not the BLA. By contrast, when prediction error is high, S2 is preserved in the focal state of attention: hence, learning under these circumstances requires NMDA receptor activation in the BLA and not the PRh. The impact of prediction error on the processing of S2 could be assessed using two types of designs. In the first design, rats are pre-exposed to S2-S1 pairings in stage 1 and this is followed by S2-S3-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is followed by surprise in omission of S1 and presentation of S3. Thus, if a large prediction error maintains processing of the familiar S2 in the BLA, we might expect that its conditioning in this design would require NMDA receptor activation in the BLA (in contrast to the results of Experiment 3B) and no longer require NMDA receptor activation in the PRh (in contrast to the results of Experiment 3A). In the second design, rats are pre-exposed to S2 alone in stage 1 and this is followed by S2-[trace]-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is not followed by the surprising omission of any stimulus. Thus, if a small prediction error shifts processing of the familiar S2 to the PRh, we might expect that its conditioning in this design would no longer require NMDA receptor activation in the BLA (in contrast to the results of Experiment 4B) but, instead, require NMDA receptor activation in the PRh (in contrast to the results of Experiment 4A). Future studies will use both designs to determine whether prediction error influences the processing of S2 in the focus versus periphery of attention and, thereby, whether learning about this stimulus requires NMDA receptor activation in the BLA or PRh.”

      Reviewer #3 (Public review):

      Summary:

      This manuscript presents a series of experiments that further investigate the roles of the BLA and PRH in sensory preconditioning, with a particular focus on understanding their differential involvement in the association of S1 and S2 with shock.

      Strengths:

      The motivation for the study is clearly articulated, and the experimental designs are thoughtfully constructed. I especially appreciate the inclusion of Table 1, which makes the designs easy to follow. The results are clearly presented, and the statistical analyses are rigorous. My comments below mainly concern areas where the writing could be improved to help readers more easily grasp the logic behind the experiments.

      Weaknesses:

      (1) Lines 56-58: The two previous findings should be more clearly summarized. Specifically, it's unclear whether the "mediated S2-shock" association occurred during Stage 2 or Stage 3. I assume the authors mean Stage 2, but Stage 2 alone would not yet involve "fear of S2," making this expression a bit confusing.

      We apologise for the confusion and have revised the summary of our previous findings on lines 55-63. The revised text now states: “In a recent pair of studies, we extended these findings in two ways. First, we showed that S1 does not just form an association with shock in stage 2; it also mediates an association between S2 and the shock. Thus, S2 enters testing in stage 3 already conditioned, able to elicit fear responses (Wong et al., 2019). Second, we showed that this mediated S2-shock association requires NMDAR-activation in the PRh, as well as communication between the PRh and BLA (Wong et al., 2025). These findings raise two critical questions: 1) why is the PRh engaged for mediated conditioning of S2 but not for direct conditioning of S1; and 2) more generally, what determines whether the BLA and/or PRh is engaged for conditioning of the S1 and/or S2?”

      (2) Line 61: The phrase "Pavlovian fear conditioning" is ambiguous in this context. I assume it refers to S1-shock or S2-shock conditioning. If so, it would be clearer to state this explicitly.

      Apologies for the ambiguity - we have omitted the term “Pavlovian” which may have been the source of confusion: The revised text on lines 60-63 now states: “These findings raise two critical questions: 1) why is the PRh engaged for mediated conditioning of S2 but not for direct conditioning of S1; and 2) more generally, what determines whether the BLA and/or PRh is engaged for conditioning of the S1 and/or S2?”

      (3) Regarding the distinction between having or not having Stage 1 S2-S1 pairings, is "novel vs. familiar" the most accurate way to frame this? This terminology could be misleading, especially since one might wonder why S2 couldn't just be presented alone on Stage 1 if novelty is the critical factor. Would "outcome relevance" or "predictability" be more appropriate descriptors? If the authors choose to retain the "novel vs. familiar" framing, I suggest providing a clear explanation of this rationale before introducing the predictions around Line 118.

      We have incorporated the suggestion regarding “predictability” while also retaining “novelty” as follows. 

      L76-85: “For example, different types of arrangements may influence the substrates of conditioning to S2 by influencing its novelty and/or its predictive value at the time of the shock, on the supposition that familiar stimuli are processed in the periphery of attention and, thereby, the PRh (Bogacz & Brown, 2003; Brown & Banks, 2015; Brown & Bashir, 2002; Martin et al., 2013; McClelland et al., 2014; Morillas et al., 2017; Murray & Wise, 2012; Robinson et al., 2010; Suzuki & Naya, 2014; Voss et al., 2009; Yang et al., 2023) whereas novel stimuli are processed in the focus of attention and, thereby, the amygdala (Holmes et al., 2018; Qureshi et al., 2023; Roozendaal et al., 2006; Rutishauser et al., 2006; Schomaker & Meeter, 2015; Wright et al., 2003).”

      L116-120: “Subsequent experiments then used variations of this protocol to examine whether the engagement of NMDAR in the PRh or BLA for Pavlovian fear conditioning is influenced by the novelty/predictive value of the stimuli at the time of the shock (second implication of theory) as well as their distance or separation from the shock (third implication of theory; Table 1).”

      (4) Line 121: This statement should refer to S1, not S2.

      (5) Line 124: This one should refer to S2, not S1.

      We have checked the text on these lines for errors and confirmed that the statements are correct. The lines encompassing this text (L121-130) are reproduced here for convenience:

      (1) When rats are exposed to novel S2-S1-shock sequences, conditioning of S2 and S1 will be disrupted by a DAP5 infusion into the BLA but not into the PRh (Experiments 2A and 2B);

      (2) When rats are exposed to S2-S1 pairings and then to S2-S1-shock sequences, conditioning of S2 will be disrupted by a DAP5 infusion into the PRh but not the BLA whereas conditioning of S1 will be disrupted by a DAP5 infusion into the BLA not the PRh (Experiments 3A and 3B);

      (3) When rats are exposed to S2-S1 pairings and then to S2 (trace)-shock pairings, conditioning of S2 will be disrupted by a DAP5 into the BLA not the PRh (Experiments 4A and 4B).

      (6) Additionally, the rationale for Experiment 4 is not introduced before the Results section. While it is understandable that Experiment 4 functions as a follow-up to Experiment 3, it would be helpful to briefly explain the reasoning behind its inclusion.

      Experiment 4 follows from the results obtained in Experiment 3; and, as noted, the reasoning for its inclusion is provided locally in its introduction. We attempted to flag this experiment earlier in the general introduction to the paper; but this came at the cost of clarity to the overall story. As such, our revised paper retains the local introduction to this experiment. It is reproduced here for convenience:

      “In Experiments 3A and 3B, conditioning of the pre-exposed S1 required NMDAR-activation in the BLA and not the PRh; whereas conditioning of the pre-exposed S2 required NMDAR-activation in the PRh and not the BLA. We attributed these findings to the fact that the pre-exposed S2 was separated from the shock by S1 during conditioning of the S2-S1-shock sequences in stage 2: hence, at the time of the shock, S2 was no longer processed in the focal state of attention supported by the BLA; instead, it was processed in the peripheral state of attention supported by the PRh.

      “Experiments 4A and 4B employed a modification of the protocol used in Experiments 3A and 3B to examine whether a pre-exposed S1 influences the processing of a pre-exposed S2 across conditioning with S2-S1-shock sequences. The design of these experiments is shown in Figure 4A. Briefly, in each experiment, two groups of rats were exposed to a session of S2-S1 pairings in stage 1 and, 24 hours later, a session of S2-[trace]-shock pairings in stage 2, where the duration of the trace interval was equivalent to that of S1 in the preceding experiments. Immediately prior to the trace conditioning session in stage 2, one group in each experiment received an infusion of DAP5 or vehicle only into either the PRh (Experiment 4A) or BLA (Experiment 4B). Finally, all rats were tested with presentations of the S2 alone in stage 3. If the substrates of conditioning to S2 are determined only by the amount of time between presentations of this stimulus and foot shock in stage 2, the results obtained in Experiments 4A and 4B should be the same as those obtained in Experiments 3A and 3B: acquisition of freezing to S2 will require activation of NMDARs in the PRh and not the BLA. If, however, the presence of S1 in the preceding experiments (Experiments 3A and 3B) accelerated the rate at which processing of S2 transitioned from the focus of attention to its periphery, the results obtained in Experiments 4A and 4B will differ from those obtained in Experiments 3A and 3B. That is, in contrast to the preceding experiments where acquisition of freezing to S2 required NMDAR-activation in the PRh and not the BLA, here acquisition of freezing to S2 should require NMDAR-activation in the BLA but not the PRh.”

      Reviewer #1 (Recommendations for the authors):

      I greatly enjoyed reading and reviewing this manuscript, and so I only have boilerplate recommendations.

      (1) I might add a couple of sentences discussing how/why preconditioned fear could be intact while first-order fear is impaired. Of course, if I am interpreting the provided interpretation correctly, the reason is that peripheral processing is still intact even when BLA NMDA receptors are blocked, and so mediated conditioning still occurs. Does this mean that mediated conditioning does not require learning the first-order relationship, and that they occur in parallel? Perhaps I just missed this, but I cannot help but wonder whether/how the psychological processes at play might change when first-order learning is impaired, so this would be greatly appreciated.

      As noted above, we have revised the general introduction (around lines 55-59) to clarify that the direct S1-shock and mediated S2-shock associations form in parallel. Hence, manipulations that disrupt first-order fear to the S1 (such as a BLA infusion of the NMDA receptor antagonist, DAP5) do not automatically disrupt the expression of sensory preconditioned fear to the S2.

      (2) Adding to the above - does the SOP or another theory predict serial vs parallel information flow from focal state to peripheral, or perhaps it is both to some extent?

      SOP predicts both serial and parallel processing of information in its focal and peripheral states. That is, some proportion of the elements that comprise a stimulus may decay from the focal state of attention to the periphery (serial processing); hence, at any given moment, the elements that comprise a stimulus can be represented in both focal and peripheral states (parallel processing).

      Given the nature of the designs and tools used in the present study (between-subject assessment of a DAP5 effect in the BLA or PRh), we selected parameters that would maximize the processing of the S2 and S1 stimuli in one or the other state of activation; hence the results of the present study. We are currently examining the joint processing of stimulus elements across focal and peripheral states using simultaneous recordings of activity in the BLA and PRh. These recordings are collected from rats trained in the different stages of a within-subject sensory preconditioning protocol. The present study created the basis for this work, which will be published separately in due course.

      (3) The organization of PRh vs BLA is nice and consistent across each figure, but I would suggest adding any kind of additional demarcation beyond the colors and text, maybe just more space between AB / CD. The figure text indicating PRh/BLA is a bit small.

      Thank you for the suggestion – we have added more space between the top and bottom panels of the figure.

      (4) Line 496 typo ..."in the BLA but not the BLA".

      Apologies for the type - this has been corrected.

      Reviewer #2 (Recommendations for the authors):

      I found the experiments to be extremely well-designed and the results convincing and exciting. The hypothesis of the focal and peripheral states of attention being encoded by BLA and PRh respectively, is enticing, yet as indicated in the public review, this does not seem to be the only possible interpretation. This is my only serious comment for the authors.

      (1) I think it would be worth reframing the article slightly to give credence to alternative hypotheses. Not to say that the authors' intriguing hypothesis shouldn't be an integral part of the introduction, but no alternatives are mentioned. In experiment 2, could the fact that S2 is already being a predictor of S1, not block new learning to S2? In the framework of stimulus-stimulus associations, there would be no surprise in the serial-compound stage of conditioning at the onset of S1. This may prevent direct learning of the S2-shock association within the BLA. This type of association may as well (S2 predicts S1, but it's omitted), which could support learning by S2. fall under the peripheral/focal theory, but I don't think it's necessary to frame this possibility in terms of a peripheral/focal theory. To build on this alternative interpretation, the absence of S1 in experiment 4 may induce a prediction error. The peripheral and focal states appear to correspond to A2 and A1 in SOP extremely well, and I think it would potentially add interest and support. If the authors do intend to make the paper a strong argument for their hypothesis, perhaps a few additional experiments may be introduced. If the novelty of S2 is critical for S2 not to be processed in a focal state during the serial compound stage, could pre-exposure of S2 alone allow for dependence of S2-shock on the PRh? Assuming this is what the authors would predict, this might disentangle the S-S theory mentioned above from the peripheral/focal theory. Or perhaps run an experiment S2-X in stage 1 and S2-S1-shock in stage 2? This said, I think the experiments are more than sufficient for an exciting paper as is, and I don't think running additional experiments is necessary. I would only argue for this if the authors make a hard claim about the peripheral/focal theory, as is the case for the way the paper is currently written.

      We appreciate the reviewer’s excellent point and suggestions. We have included an additional paragraph in the Discussion on page 24 (lines 566-594).  “An additional point to consider in relation to Experiments 3A, 3B, 4A and 4B is the level of surprise that rats experienced following presentations of the familiar S2 in stage 2. Specifically, in Experiments 3A and 3B, S2 was followed by the expected S1 (low surprise) and its conditioning required activation of NMDA receptors in the PRh and not the BLA. By contrast, in Experiments 4A and 4B, S2 was followed by omission of the expected S1 (high surprise) and its conditioning required activation of NMDA receptors in the BLA and not the PRh. This raises the possibility that surprise, or prediction error, also influences the way that S2 is processed in focal and peripheral states of attention. When prediction error is low, S2 is processed in the peripheral state of attention: hence, learning under these circumstances requires NMDA receptor activation in the PRh and not the BLA. By contrast, when prediction error is high, S2 is preserved in the focal state of attention: hence, learning under these circumstances requires NMDA receptor activation in the BLA and not the PRh. The impact of prediction error on the processing of S2 could be assessed using two types of designs. In the first design, rats are pre-exposed to S2-S1 pairings in stage 1 and this is followed by S2-S3-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is followed by surprise in omission of S1 and presentation of S3. Thus, if a large prediction error maintains processing of the familiar S2 in the BLA, we might expect that its conditioning in this design would require NMDA receptor activation in the BLA (in contrast to the results of Experiment 3B) and no longer require NMDA receptor activation in the PRh (in contrast to the results of Experiment 3A). In the second design, rats are pre-exposed to S2 alone in stage 1 and this is followed by S2-[trace]-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is not followed by the surprising omission of any stimulus. Thus, if a small prediction error shifts processing of the familiar S2 to the PRh, we might expect that its conditioning in this design would no longer require NMDA receptor activation in the BLA (in contrast to the results of Experiment 4B) but, instead, require NMDA receptor activation in the PRh (in contrast to the results of Experiment 4A). Future studies will use both designs to determine whether prediction error influences the processing of S2 in the focus versus periphery of attention and, thereby, whether learning about this stimulus requires NMDA receptor activation in the BLA or PRh.”

      (3) I was surprised the authors didn't frame their hypothesis more in terms of Wagner's SOP model. It was minimally mentioned in the introduction or the authors' theory if it were included more in the introduction. I was wondering whether the authors may have avoided this framing to avoid an expectation for modeling SOP in their design. If this were the case, I think the paper stands on its own without modeling, and at least for myself, a comparison to SOP would not require modeling of SOP. If this was the authors' concern for avoiding it, I would suggest to the authors that they need not be concerned about it.

      We appreciate the endorsement of Wagner’s SOP theory as a nice way of framing our results. We are currently working on a paper in which we use simulations to show how Wagner’s theory can accommodate the present findings as well as others in the literature on sensory preconditioning. For this reason, we have not changed the current paper in relation to this point.

    1. Author response:

      Reviewer #1 (Public review)

      I have to preface my evaluation with a disclosure that I lack the mathematical expertise to fully assess what seems to be the authors' main theoretical contribution. I am providing this assessment to the best of my ability, but I cannot substitute for a reviewer with more advanced mathematical/physical training.

      Summary:

      This paper describes a new theoretical framework for measuring parsimony preferences in human judgments. The authors derive four metrics that they associate with parsimony (dimensionality, boundary, volume, and robustness) and measure whether human adults are sensitive to these metrics. In two tasks, adults had to choose one of two flower beds which a statistical sample was generated from, with or without explicit instruction to choose the flower bed perceptually closest to the sample. The authors conduct extensive statistical analyses showing that humans are sensitive to most of the derived quantities, even when the instructions encouraged participants to choose only based on perceptual distance. The authors complement their study with a computational neural network model that learns to make judgments about the same stimuli with feedback. They show that the computational model is sensitive to the tasks communicated by feedback and only uses the parsimony-associated metrics when feedback trains it to do so.

      Strengths:

      (1)  The paper derives and applies new mathematical quantities associated with parsimony. The mathematical rigor is very impressive and is much more extensive than in most other work in the field, where studies often adopt only one metric (such as the number of causes or parameters). These formal metrics can be very useful for the field.

      (2)  The studies are preregistered, and the statistical analyses are strong.

      (3)  The computational model complements the behavioral findings, showing that the derived quantities are not simply equivalent to maximum-likelihood inference in the task.

      (4)  The speculations in the discussion section (e.g., the idea that human sensitivity is driven by the computational demands each metric requires) are intriguing and could usefully guide future work.

      Weaknesses:

      (1) The paper is very hard to understand. Many of the key details of the derived metrics are in the appendix, with very little accessible explanation in the main text. The figures helped me understand the metrics somewhat, although I am still not sure how some of them (such as boundary or robustness as measured here) are linked to parsimony. I understand that this is addressed by the derivations in the appendix, but as a computational cognitive scientist, I would have benefited from more accessible explanations. Important aspects of the human studies are also missing from the main text, such as the sample size for Experiment 2.

      (2) It is not fully clear whether the sensitivity of human participants to some of the quantities convincingly reported here actually means that participants preferred shapes according to the corresponding aspect of parsimony. The title and framing suggest that parsimony "guides" human decision-making, which may lead readers to conclude that humans prefer more parsimonious shapes. I am not sure the sensitivity findings alone support this framing, but it might just be my misunderstanding of the analyses.

      (3) The stimulus set included only four combinations of shapes, each designed to diagnostically target one of the theoretical quantities. It is unclear whether the results are robust or specific to these particular 4 stimuli.

      (4) The study is framed as measuring "decision-making," but the task resembles statistical inference (e.g., which shape generated the data) or perceptual judgment. This is a minor point since "decision-making" is not well defined in the literature, yet the current framing in the title gave me the initial impression that humans would be making preference choices and learning about them over time with feedback.

      We are grateful for the supportive comments highlighting the rigor of our experimental design and data analysis. The Reviewer lists four points under “weaknesses”, to which we reply below. 

      (1)  The paper is very hard to understand

      In the revised version of the paper, we will expand the main text to include a more detailed and intuitive description of the terms of the Fisher Information Approximation, in particular clarifying the interpretation of robustness and boundary as parsimony. We also will include more details that are now given only in Methods, such as the sample size for the second experiment. 

      (2) Sensitivity of human participants 

      We do argue, and believe, that our data show that people tend to prefer simpler shapes. However, giving a well-posed definition of "preference" in this context turns out to be nontrivial.

      At the very least, any statement such as "people prefer shape A over B" should be qualified with something like “when the distance of the data from both shapes is the same.” In other words, one should control for goodness-of-fit. Even before making any reference to our behavioral model, this phenomenon (a preference for the simpler model when goodness of fit is matched between models) is visible in Figure 3a, where the effective decision boundary used by human participants is closer to the more complex model than the cyan line representing the locus of points with equal goodness of fit under the two models (or equivalently, with the same Euclidean distance from the two shapes). The goal of our theory and our behavioral model is precisely to systematize this sort of control, extending it beyond just goodness-of-fit and allowing us to control simultaneously for multiple features of model complexity that may affect human behavior in different ways. In other words, it allows us not only to ask whether people prefer shape A over B after controlling for the distance of the data to the shapes, but also to understand to what extent this preference is driven by important geometrical features such as dimensionality, volume, curvature, and boundaries of the shapes. More specifically, and importantly, our theory makes it possible to measure the strength of the preference, rather than merely asserting its existence. In our modeling framework, the existence of a preference for simpler shapes is captured by the fact that the estimated sensitivities to the complexity penalties are positive (and although they differ in magnitude, all are statistically reliable).

      (3) Generalization to different shapes  

      Thank you for bringing up this important topic. First, note that while dimensionality and volume are global properties of models and only take two possible values in our human tasks, the boundary and robustness penalties depend on the model and on the data and therefore assume a continuum of values through the tasks (note also that the boundary penalty is relevant for all task types, not just the one designed specifically to study it, because all models except the zero-dimensional dot have boundaries). Therefore, our experimental setting is less restrictive of what it may seem, because it explores a range of possible values for two of the four model features. However, we agree that it would be interesting to repeat our experiment with a broader range of models, perhaps allowing their dimensionality and volume to vary more. In the same spirit, it would be interesting to study the dependence of human behavior on the amount of available data. We believe that these are all excellent ideas for further study that exceed the scope of the present paper. We will include these important points in a revised Discussion. 

      (4) Usage of “decision making” vs “perceptual judgment”

      Thank you. We will clarify better in the text that our usage of “decision making” overlaps with the idea of a perceptual judgment and that our experiments do not tackle sequential aspects of repeated decisions. 

      Reviewer #2 (Public review):

      This manuscript presents a sophisticated investigation into the computational mechanisms underlying human decision-making, and it presents evidence for a preference for simpler explanations (Occam's razor). The authors dissect the simplicity bias into four different components, and they design experiments to target each of them by presenting choices whose underlying models differ only in one of these components. In the learning tasks, participants must infer a "law" (a logical rule) from observed data in a way that operationalizes the process of scientific reasoning in a controlled laboratory setting. The tasks are complex enough to be engaging but simple enough to allow for precise computational modeling.

      As a further novel feature, authors derive a further term in the expansion of the logevidence, which arises from boundary terms. This is combined with a choice model, which is the one that is tested in experiments. Experiments are run, but with humans and with artificial intelligence agents, showing that humans have an enhanced preference for simplicity as compared to artificial neural networks.

      Overall, the work is well written, interesting, and timely, bridging concepts in statistical inference and human decision making. Although technical details are rather elaborate, my understanding is that they represent the state of the art.

      I have only one main comment that I think deserves more comments. Computing the complexity penalty of models may be hard. It is unlikely that humans can perform such a calculation on the fly. As authors discuss in the final section, while the dimensionality term may be easier to compute, others (e.g., the volume term, which requires an integral) may be considerably harder to compute (it is true that they should be computed once and for all for each task, but still...). I wonder whether the sensitivity of human decision making with reference to the different terms is so different, and in particular whether it aligns with computational simplicity, or with the possibility of approximating each term by simple heuristics. Indeed, the sensitivity to the volume term is significantly and systematically lower than that of other terms. I wonder whether this relation could be made more quantitative using neural networks, using as a proxy of computational hardness the number of samples needed to reach a given error level in learning each of these terms.

      Thank you. The computational complexity associated with calculating the different terms and its potential connection to human sensitivity to the terms is an intriguing topic. As we hinted at in the discussion, we agree with the reviewer that this is a natural candidate for further research, which likely deserves its own study and exceeds the scope of the present paper. 

      As a minor aside, at least for the present task the volume term may not be that hard to compute, because it can be expressed with the number of distinguishable probability distributions in the model (Balasubramanian 1996). Given the nature of our task, where noise is Gaussian, isotropic and with known variance, the geometry of the model is actually the Euclidean geometry of the plane, and the volume is simply the (log of the) length of the line that represents the one-dimensional models, measured in units of the standard deviation of the noise.

      Reviewer #3 (Public review):

      Summary:

      This is a very interesting paper that documents how humans use a variety of factors that penalize model complexity and integrate over a possible set of parameters within each model. By comparison, trained neural networks also use these biases, but only on tasks where model selection was part of the reward structure. In the situation where training emphasizes maximum-likelihood decisions, only neural networks, but not humans, were able to adapt their decision-making. Humans continue to use model integration simplicity biases.

      Strengths:

      This study used a pre-registered plan for analyzing human data, which exceeds the standards compared to other current studies.

      The results are technically correct.

      Weaknesses:

      The presentation of the results could be improved.

      We thank the reviewer for their appreciation of our experimental design and methodology, and for pointing out (in the separate "recommendations to authors") a few passages of the paper where the presentation could be improved. We will clarify these passages in the revision.

    1. Author response:

      We thank the reviewers for their thoughtful and constructive comments. We are pleased that they found the study technically strong and the integration of EEG decoding, immersive VR, and eye tracking valuable.

      Across all three reviews, several points of clarification emerged. In our revision, we will focus on:

      (1) Improving clarity and structure of the manuscript (Reviewer #1).

      We will strengthen the flow between the Methods and Results subsections and include explicit concluding statements for the single results.

      (2) Emphasize methodological scope and limitations in terms of stimulus set and generalizability (Reviewers #2 and #3).

      We will further emphasize that a key objective was to establish, for the first time, the methodological feasibility of decoding facial features (especially emotional expressions) under VR conditions, and that our stimulus set (consisting of facial expressions that were easy to distinguish) limits (a) the task-relevance (and thus possibly the neural integration) of depth information and (b) the generalizability to less easily distinguishable settings. We appreciate the suggestion of an inverted-face control to further investigate the extent to which the decoding results were based on low-level features; however, we do not plan a follow-up experiment at this stage; instead, we will discuss this limitation more explicitly.

      We believe these revisions will substantially strengthen the manuscript and further highlight its methodological focus.

    1. Author response:

      Thanks for these insightful reviews and your summary assessment. We certainly agree that ours was a laboratory study with a single specialized insect, and both mixtures types had all five compounds (controlling for total toxin concentration). Thus, our conclusion that combined effects of naturally occurring toxins (within the cardenolide class) have non-additive effects for the specialized sequestering monarch are constrained by our experimental conditions. In our assay we used two mixture types, equimolar and “natural” proportions. We acknowledge that the natural proportions will vary with plant age, damage history, etc. of the host plant, Asclepias curassavica. Our proportions were based on growing the plants a few different times under variable conditions. Although we did not conduct these experiments on non-adapted insects, we discuss a related experiment that was conducted with wild-type and genetically engineered Drosophila (Lopez-Goldar et al. 2024, PNAS). In sum, we appreciate the reviewers’ comments.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Faiz et al. investigate small molecule-driven direct lineage reprogramming of mouse postnatal mouse astrocytes to oligodendrocyte lineage cells (OLCs). They use a combination of in vitro, in vivo, and computational approaches to confirm lineage conversion and to examine the key underlying transcription factors and signaling pathways. Lentiviral delivery of transcription factors previously reported to be essential in OLC fate determination-Sox10, Olig2, and Nkx2.2-to astrocytes allows for lineage tracing. They found that these transcription factors are sufficient in reprogramming astrocytes to iOLCs, but that the OLCs range in maturity level depending on which factor they are transfected with. They followed up with scRNA-seq analysis of transfected and control cultures 14DPT, confirming that TF-induced astrocytes take on canonical OLC gene signatures. By performing astrocyte lineage fate mapping, they further confirmed that TF-induced astrocytes give rise to iOLCs. Finally, they examined the distinct genetic drivers of this fate conversion using scRNA-seq and deep learning models of Sox10- astrocytes at multiple time points throughout the reprogramming. These findings are certainly relevant to diseases characterized by the perturbation of OLC maturation and/or myelination, such as Multiple Sclerosis and Alzheimer's Disease. Their application of such a wide array of experimental approaches gives more weight to their findings and allows for the identification of additional genetic drivers of astrocyte to iOLC conversion that could be explored in future studies. Overall, I find this manuscript thoughtfully constructed and only have a few questions to be addressed. 

      (1) The authors suggest that Sox10- and Olig2- transduced astrocytes result in distinct subpopulations iOLCs. Considering it was discussed in the introduction that these TFs cyclically regulate one another throughout differentiation, could they speculate as to why such varying iOLCs resulted from the induction of these two TFs? 

      We thank the Reviewer for the opportunity to speculate. We hypothesize that Sox10 and Olig2 may induce different OLCs as a result of differential activation of downstream genes within the gene regulatory network, which are important for OPC, committed OLC and mature OL identity [1]. In support of this, we found different expression levels of genes involved in downstream OLC specification networks [1], including Sox6, Tcfl2 and Myrf, at D14 (Author response image 1), following further analysis of our RNA-seq data.

      Author response image 1.

      Expression of OLC regulatory network genes in Sox10- and Olig2- cultures. Violin plots show gene expression levels (log-normalized) of downstream OLC regulatory genes (Sox6, Zeb2, Tcf7l2, Myrf, Zfp488, Nfatc2, Hes5, Id2) between Sox10 and Olig2 treated OLCs at 14 days post transduction. Analysis was performed on oligodendrocyte progenitor and mature oligodendrocyte clusters (from Manuscript Figure 1D, clusters 3 and 8).

      (2) In Figure 1B it appears that the Sox10- MBP+ tdTomato+ cells decreases from D12 to D14. Does this make sense considering MBP is a marker of more mature OLCs? 

      Thank you for this comment. To address this, we compared the number of MBP+tdTomato+ Sox10 cells across reprogramming timepoints. We saw no difference between the number of MBP+tdTomato+ OLs at D12 and D14 (Author response image 2, p = 0.2314). However,  we do see a [nonsignificant] decrease in MBP+tdTomato+ Sox10 cells from D12 to D22 (Manuscript Supplementary Figure 3B, Author response image 2, p= 0.0543), which suggests that culture conditions are not optimal for longer-term cell survival [2], [3], [4].  

      Author response image 2.

      Comparison of Sox10- induced MBP+tdTomato+ iOLCs over time. Quantification of MBP<sup>+</sup>tdTomato<sup>+</sup> iOLs in Sox10 cultures at D8 (n=5), D10 (n=5), D12 (n=5), D14 (n=7) and D22 (n=3) post transduction. Data are presented as mean ± SEM, each data point represents one individual cell culture experiment, Brown-Forsythe and Welch ANOVA on transformed percentages with Dunnett’s T3 multiple comparisons test (*= p<0.05).  

      (3) Previous studies have shown that MBP expression and myelination in vitro occurs at the earliest around 4-6 weeks of culturing. When assessing whether further maturation would increase MBP positivity, authors only cultured cells up to 22 DPT and saw no significant increase. Has a lengthier culture timeline been attempted? 

      We agree with the Reviewer that previous studies of pluripotent stem cell derived (hESCs or iPSCs) have shown MBP+ OLCs in vitro around 4-6 weeks [5], [6], [7]. However,  studies of neural stem cells [8] or fibroblasts [9] conversion show OLC appearance after 7 and 24 days, respectively, demonstrating that OLCs can be generated in vitro within 1-3 weeks of plating. Moreover, as noted above in response to #2, we see fewer MBP+ cells at  22DPT, suggesting that extended time in culture may require additional factors for support. Therefore, we did not attempt longer timepoints. 

      (4) Figure S4D is described as "examples of tdTomatonegzsGreen+OLCmarker+ cells that arose from a tdTomatoneg cell with an astrocyte morphology." The zsGreen+ tdTomato- cell is not convincingly of "astrocyte morphology"; it could be a bipolar OLC. To strengthen the conclusions and remove this subjectivity, more extensive characterizations of astrocyte versus OLC morphology in the introduction or results are warranted. This would make this observation more convincing since there is clearly an overlap in the characteristics of these cell types.  

      We thank the reviewer for this excellent suggestion. To assess astrocyte morphology, we measured the cell size, nucleus size, number of branches and branch thickness of 70 Aldh1l1+tdTomato+ astrocytes in tamoxifen-labelled Aldh1l1-CreERT2;Ai14 cultures (new Supplemental Table 1). To assess OPC morphology, we  performed IHC for PDGFRa in iOLC cultures and measured the same parameters in 70 PDGFRa+ OPCs (new Supplemental Table 1).  We found that astrocytes were characterized by larger branch thickness, cell length and nucleus size, while OPCs showed a larger number of branches (new Supplemental Figure 1, and Author response image 3 below). Based on this framework, the AAV9-GFAP::zsGreen<sup>pos</sup>Aldh1l1-tdTomato<sup>neg</sup> and AAV9-GFAP::zsGreen<sup>pos</sup>Aldh1l1-tdTomato<sup>pos</sup>starting cells tracked fall within the bounds of ‘astrocytes’. We have revised the manuscript to include this more rigorous characterization (Line 119-124, Page 4; Line 307-312, Page 9; Line 323-326, Page 9). We also demonstrate (below) that the GFAP::zsGreen<sup>pos</sup> Aldh1l1-tdTomato<sup>pos</sup> and GFAP::zsGreen<sup>pos</sup>Aldh1l1-tdTomato<sup>neg</sup> starting cell depicted in Figure 2G and Supplemental Figure 5D is consistent with astrocyte morphology (Author response image 3). 

      Author response image 3.

      Morphological characterization of astrocytes, oligodendrocyte lineage cells, and starting cells. Quantification of the (A) cell length, (B) nucleus size, (C) number of branches, and (D) branch thickness iAldh1l1+tdTomato+ and PDGFRα+ OPCs (n= 70 per cell type, data are presented as mean ± SEM). Orange line indicates parameter value for GFAP::zsGreen<sup>pos</sup>Aldh1l1-tdTomato<sup>pos</sup> starting cell in Figure 2G. Green line indicates parameter value for GFAP::zsGreen<sup>pos</sup> Aldh1l1-tdTomato<sup>neg</sup> starting cell in Supplemental Figure 5D.

      Reviewer #2 (Public Review):             

      The study by Bajohr investigates the important question of whether astrocytes can generate oligodendrocytes by direct lineage conversion (DLR). The authors ectopically express three transcription factors - Sox10, Olig2 and Nkx6.2 - in cultured postnatal mouse astrocytes and use a combination of Aldh1|1-astrocyte fate mapping and live cell imaging to demonstrate that Sox10 converts astrocytes to MBP+ oligodendrocytes, whereas Olig2 expression converts astrocytes to PDFRalpha+ oligodendrocyte progenitor cells. Nkx6.2 does not induce lineage conversion. The authors use single-cell RNAseq over 14 days post-transduction to uncover molecular signatures of newly generated iOLs.  

      The potential to convert astrocytes to oligodendrocytes has been previously analyzed and demonstrated. Despite the extensive molecular characterization of the direct astrocyteoligodendrocyte lineage conversion, the paper by Bajohr et al. does not represent significant progress. The entire study is performed in cultured cells, and it is not demonstrated whether this lineage conversion can be induced in astrocytes in vivo, particularly at which developmental stage (postnatal, adult?) and in which brain region. The authors also state that generating oligodendrocytes from astrocytes could be relevant for oligodendrocyte regeneration and myelin repair, but they don't demonstrate that lineage conversion can be induced under pathological conditions, particularly after white matter demyelination. Specific issues are outlined below. 

      We thank the reviewer for this summary. We agree that there are a handful of reports of astrocytelike cells to OLC conversion [10], [11]. However, our study is the first study to confirm bonafide astrocyte to OLC conversion, which is important given the recent controversy in the field of in vivo astrocyte to neuron reprogramming [12]. In addition, the extensive characterization of the molecular timeline of reprogramming, highlights that although conversion of astrocytes is possible by ectopic expression of any of the three factors, the subtypes of astrocytes converted and maturity of OLCs produced may vary depending on the choice of TF delivered. Our findings will inform future in vivo studies of iOLC generation that aim to understand the impact of brain region, age, pathology, and sex, which are especially important given the diversity of astrocyte responses to disease [13], [14], [15].

      (1) The authors perform an extensive characterization of Sox10-mediated DLR by scRNAseq and demonstrate a clear trajectory of lineage conversion from astrocytes to terminally differentiated MBP+ iOLCs. A similar type of analysis should be performed after Olig2 transduction, to determine whether transcriptomics of olig2 conversion overlaps with any phase of sox10 conversion.

      We thank the Reviewer for this excellent comment. We chose to include an in-depth analysis of Sox10 in the manuscript, as Sox10-transduced cultures showed a higher percentage of mature iOLCs compared to Olig2 in our studies. We have added this specific rationale to the manuscript (Line 329-330-Page 9). 

      Nonetheless, we also agree that understanding the underpinnings of Olig2-mediated conversion is important. Therefore, we used Cell Oracle [16] to understand the regulation of cell identity by Olig2.  in silico overexpression of Olig2 in our control time course dataset (D0, D3, D8 and D14) showed cell movement from cluster 1, characterized by astrocyte genes [Mmd2[17], Entpd2[18], H2-D1[19]], towards cluster 5, characterized by OPC genes [Pdgfra[20], Myt1[21]] validating astrocyte to OLC conversion by Olig2 (Author response image 4).

      We hypothesize that reprogramming via Sox10 and Olig2 take different conversion paths to oligodendrocytes for the following reasons. 

      (1) Differential astrocyte gene expression at D14 when cells are exposed to Sox10 and Olig2 (Manuscript Figure 1D-E [Sox10 characterized by Lcn2[19], C3[19]; Olig2 characterized by Slc6a11[22], Slc1a2[23]].

      (2) Differential expression of key OLC gene regulatory network genes at D14 between cells treated with Sox10 and Olig2 (Author response image 1). 

      Author response image 4.

      in silico modeling of Olig2 reprogramming (A) UMAP clustering of Cre control treated cells from 0, 3, 8, and 14 days post transduction (DPT). (B) UMAP clustering from (A) overlayed with timepoint and treatment group. (C) Cell Oracle modeling of predicted cell trajectories following Olig2 knock in (KI), overlaid onto UMAP plot. Arrows indicate cell movement prediction with Olig2 KI perturbation.  

      (2) A complete immunohistochemical characterization of the cultures should be performed at different time points after Sox10 and Olig2 transduction to confirm OL lineage cell phenotypes. 

      We performed a complete immunohistochemical characterization of Ai14 cultures transduced with GFAP::Sox10-Cre and GFAP::Olig2-Cre. This system allows permanent labelling and therefore, enabled the tracking of transduced cells through the process or DLR, which we believe is the most appropriate way to characterize iOLC conversion efficiencies. We then confirmed the conversion of Aldh1l1+ astrocytes in Aldh1l1-CreERT2;Ai14 cultures transduced with GFAP::Sox10-zsGreen and GFAP::Olig2-zsGreen. In this system, GFAP drives the expression of zsGreen, and therefore, may not faithfully track all cells and lead to an underestimate of the numbers of converted cells. For example, iOLCs from Aldh1l1<sup>neg</sup> astrocytes or iOLCs that have lost zsGreen expression following conversion. Therefore we use this system only to confirm astrocyte origin.

      Nonetheless, we appreciate this comment and recognize that there may be differences in conversion efficiencies when analyzing Aldh1l1+ astrocytes versus all transduced cells. Therefore, we have softened the language in the manuscript (see below) regarding Olig2 and Sox10 generating different OLC phenotypes and now claim iOLC generation from both Sox10 and Olig2. We thank the Reviewer for this comment, and believe it has strengthened the discussion. 

      Line 240, Page 7

      Line 261-263, Page 8

      Line 304-307, Page 8/9

      Line 413-414, Page 11

      References

      (1) E. Sock and M. Wegner, “Using the lineage determinants Olig2 and Sox10 to explore transcriptional regulation of oligodendrocyte development,” Dev Neurobiol, vol. 81, no. 7, pp. 892–901, Oct. 2021, doi: 10.1002/dneu.22849.

      (2) B. A. Barres, M. D. Jacobson, R. Schmid, M. Sendtner, and M. C. Raff, “Does oligodendrocyte survival depend on axons?,” Current Biology, vol. 3, no. 8, pp. 489–497, Aug. 1993, doi: 10.1016/0960-9822(93)90039-Q.

      (3) A.-N. Cho et al., “Aligned Brain Extracellular Matrix Promotes Differentiation and Myelination of Human-Induced Pluripotent Stem Cell-Derived Oligodendrocytes,” ACS Appl. Mater. Interfaces, vol. 11, no. 17, pp. 15344–15353, May 2019, doi: 10.1021/acsami.9b03242.

      (4) E. G. Hughes and M. E. Stockton, “Premyelinating Oligodendrocytes: Mechanisms Underlying Cell Survival and Integration,” Front. Cell Dev. Biol., vol. 9, Jul. 2021, doi: 10.3389/fcell.2021.714169.

      (5) M. Ehrlich et al., “Rapid and efficient generation of oligodendrocytes from human induced pluripotent stem cells using transcription factors,” Proc Natl Acad Sci U S A, vol. 114, no. 11, pp. E2243–E2252, Mar. 2017, doi: 10.1073/pnas.1614412114.

      (6) Y. Liu, P. Jiang, and W. Deng, “OLIG gene targeting in human pluripotent stem cells for motor neuron and oligodendrocyte differentiation,” Nat Protoc, vol. 6, no. 5, pp. 640–655, May 2011, doi: 10.1038/nprot.2011.310.

      (7) S. A. Goldman and N. J. Kuypers, “How to make an oligodendrocyte,” Development, vol. 142, no. 23, pp. 3983–3995, Dec. 2015, doi: 10.1242/dev.126409.

      (8) M. Faiz, N. Sachewsky, S. Gascón, K. W. A. Bang, C. M. Morshead, and A. Nagy, “Adult Neural Stem Cells from the Subventricular Zone Give Rise to Reactive Astrocytes in the Cortex after Stroke,” Cell Stem Cell, vol. 17, no. 5, pp. 624–634, Nov. 2015, doi:10.1016/j.stem.2015.08.002.

      (9) F. J. Najm et al., “Transcription factor–mediated reprogramming of fibroblasts to expandable, myelinogenic oligodendrocyte progenitor cells,” Nat Biotechnol, vol. 31, no. 5, pp. 426–433, May 2013, doi: 10.1038/nbt.2561.

      (10) A. Mokhtarzadeh Khanghahi, L. Satarian, W. Deng, H. Baharvand, and M. Javan, “In vivo conversion of astrocytes into oligodendrocyte lineage cells with transcription factor Sox10; Promise for myelin repair in multiple sclerosis,” PLoS One, vol. 13, no. 9, p. e0203785, Sep. 2018, doi: 10.1371/journal.pone.0203785.

      (11) S. Farhangi, S. Dehghan, M. Totonchi, and M. Javan, “In vivo conversion of astrocytes to oligodendrocyte lineage cells in adult mice demyelinated brains by Sox2,” Mult Scler Relat Disord, vol. 28, pp. 263–272, Feb. 2019, doi: 10.1016/j.msard.2018.12.041.

      (12) L.-L. Wang, C. Serrano, X. Zhong, S. Ma, Y. Zou, and C.-L. Zhang, “Revisiting astrocyte to neuron conversion with lineage tracing in vivo,” Cell, vol. 184, no. 21, pp. 5465-5481.e16, Oct. 2021, doi: 10.1016/j.cell.2021.09.005.

      (13) I  Matias, J. Morgado, and F. C. A. Gomes, “Astrocyte Heterogeneity: Impact to Brain Aging and Disease,” Front. Aging Neurosci., vol. 11, Mar. 2019, doi: 10.3389/fnagi.2019.00059.

      (14) N. Habib et al., “Disease-associated astrocytes in Alzheimer’s disease and aging,” Nat Neurosci, vol. 23, no. 6, pp. 701–706, Jun. 2020, doi: 10.1038/s41593-020-0624-8.

      (15)  M. A. Wheeler et al., “MAFG-driven astrocytes promote CNS inflammation,” Nature, vol. 578, no. 7796, pp. 593–599, Feb. 2020, doi: 10.1038/s41586-020-1999-0.

      (16) K. Kamimoto, B. Stringa, C. M. Hoffmann, K. Jindal, L. Solnica-Krezel, and S. A. Morris, “Dissecting cell identity via network inference and in silico gene perturbation,” Nature, vol. 614, no. 7949, pp. 742–751, Feb. 2023, doi: 10.1038/s41586-022-05688-9.

      (17) P. Kang et al., “Sox9 and NFIA coordinate a transcriptional regulatory cascade during the initiation of gliogenesis,” Neuron, vol. 74, no. 1, pp. 79–94, Apr. 2012, doi:10.1016/j.neuron.2012.01.024.

      (18) K. Saito et al., “Microglia sense astrocyte dysfunction and prevent disease progression in an Alexander disease model,” Brain, vol. 147, no. 2, pp. 698–716, Nov. 2023, doi:10.1093/brain/awad358.

      (19) S. A. Liddelow et al., “Neurotoxic reactive astrocytes are induced by activated microglia,” Nature, vol. 541, no. 7638, pp. 481–487, Jan. 2017, doi: 10.1038/nature21029.

      (20) Q. Zhu et al., “Genetic evidence that Nkx2.2 and Pdgfra are major determinants of the timing of oligodendrocyte differentiation in the developing CNS,” Development, vol. 141, no. 3, pp. 548–555, Feb. 2014, doi: 10.1242/dev.095323.

      (21) J. A. Nielsen, J. A. Berndt, L. D. Hudson, and R. C. Armstrong, “Myelin transcription factor 1 (Myt1) modulates the proliferation and differentiation of oligodendrocyte lineage cells,” Mol Cell Neurosci, vol. 25, no. 1, pp. 111–123, Jan. 2004, doi:10.1016/j.mcn.2003.10.001.

      (22) J. Liu, X. Feng, Y. Wang, X. Xia, and J. C. Zheng, “Astrocytes: GABAceptive and GABAergic Cells in the Brain,” Front. Cell. Neurosci., vol. 16, Jun. 2022, doi:10.3389/fncel.2022.892497.

      (23) A. Sharma et al., “Divergent roles of astrocytic versus neuronal EAAT2 deficiency on cognition and overlap with aging and Alzheimer’s molecular signatures,” Proceedings of the National Academy of Sciences, vol. 116, no. 43, pp. 21800–21811, Oct. 2019, doi:10.1073/pnas.1903566116

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1)How is this simplified model representative of what is observed biologically? A bump model does not naturally produce oscillations. How would the dynamics of a rhythm generator interact with this simplistic model?

      Bump models naturally produce sequential activity, and can be engineered to repeat this sequential activity periodically (Zhang, 1996; Samsonovich and McNaughton, 1997; Murray and Escola, 2017). This is the basis for the oscillatory behavior in the model presented here. As we describe in our paper, such a model is consistent with numerous neurobiological observations about cell-type-specific connectivity patterns. The reviewer is, however, correct to point out that our model does not incorporate other key neurobiological features--in particular, intracellular dynamical properties--that have been shown to play important roles in rhythm generation. Our aim in this work is to establish a circuit-level mechanism for rhythm generation, complementary to classical models that rely on intracellular dynamics for rhythm generation. Whether and how these mechanisms work together is something that we plan to explore in future work, and we have added a sentence to the Discussion to this effect.

      (2) Would this theoretical construct survive being expressed in a biophysical model? It seems that it should, but even a simple biological model with the basic patterns of connectivity shown here would greatly increase confidence in the biological plausibility of the theory.

      We thank the reviewer for pointing out this way to strengthen our paper. We implemented the connectivity developed in the rate models in a spiking neuron model which used EI-balanced Poisson noise as input drive. We found that we could reproduce all the main results of our analysis. In particular, with a realistic number of neurons, we observed swimming activity characterized by (i) left-right alternation, (ii) rostal-caudal propagation, and (iii) variable speed control with constant phase lag. The spiking model demonstrates that the connectivity-motif based mechanisms for rhythmogenesis that we propose are robust in a biophysical setting.

      We included these results in the updated manuscript in a new Results subsection titled “Robustness in a biophysical model.”

      (3) How stable is this model in its output patterns? Is it robust to noise? Does noise, in fact, smooth out the abrupt transitions in frequency in the middle range?

      The newly added spiking model implementation of the network demonstrates that the core mechanisms of our models are robust to noise,  since the connectivity is randomly chosen and the input drive is Poisson noise.

      To test the effect of noise as it is parametrically varied, we also added noise directly to the rate models in the form of white noise input to each unit. Namely, the rate model was adapted to obey the stochastic differential equation

      \[

      \tau_i \frac{dr_i(t)}{dt} = -r_i(t) + \left[ \sum_j W_{ij} r_j(t - \Delta_{ij}) + D_i + \sigma\xi_t \right]_+

      \]

      Here $\xi_t$ is a standard Gaussian white noise and $\sigma$ sets the strength of the noise. We found that the swimming patterns were robust at all frequencies up to $\sigma =  0.05$. Above this level, coherent oscillations started to break down for some swim frequencies. To investigate whether the noise smoothed out abrupt transitions, we swept through different values of noise and modularity of excitatory connections. The results showed very minor improvement in controllability (see figure below), but this was not significant enough to include in the manuscript.

      Author response image 1.

      (4) All figure captions are inadequate. They should have enough information for the reader to understand the figure and the point that was meant to be conveyed. For example, Figure 1 does not explain what the red dot is, what is black, what is white, or what the gradations of gray are. Or even if this is a representative connectivity of one node, or if this shows all the connections? The authors should not leave the reader guessing.

      All figure captions have been updated to enhance clarity and address these concerns.

      Reviewer #2 (Public review):

      (1) Figure 1A, if I interpret Figure 1B correctly, should there not be long descending projections as well that don't seem to be illustrated?

      Thank you for highlighting this potential point of confusion. The diagram in question was only intended to be a rough schematic of the types of connections present in the model. We have added additional descending connections as requested

      (2)Page 5, It would be good to define what is meant by slow and fast here, as this definition changes with age in zebrafish (what developmental age)?

      We have updated the manuscript to include the sentence: “These values were chosen to coincide with observed ranges from larval zebrafish.” with appropriate citation.

      Reviewer #3 (Public review):

      (1) The authors describe a single unit as a neuron, be it excitatory or inhibitory, and the output of the simulation is the firing rate of these neurons. Experimentally and in other modeling studies, motor neurons are incorporated in the model, and the output of the network is based on motor neuron firing rate, not the interneurons themselves. Why did the authors choose to build the model this way?

      We chose to leave out the motor neurons from our models for a few reasons. While motor neurons read out the rhythmic activity generated by the interneurons and may provide some feedback, they are not required for rhythmogenesis. In fact, interneuron activity (especially in the excitatory V2a neurons (Agha et al., 2024)) is highly correlated with the ventral root bursts within the same segment. This suggests that motor neurons are primarily a local readout of the rhythmic activity of interneurons; therefore, the rhythmic swimming activity can be deduced directly from the interneurons themselves.

      Moreover, there is a lack of experimental observation of the connectivity between all the cell types considered in our model and motor neurons. Hence, it was unclear how we should include them in the model. To address this, we are currently developing a data-driven approach that will determine the proper connectivity between the motor neurons and the interneurons, including intrasegmental connections.

      (2) In the single population model (Figure 1), the authors use ipsilateral inhibitory connections that are long-range in an ascending direction. Experimentally, these connections have been shown to be local, while long-range ipsilateral connections have been shown to be descending. What were the reasons the authors chose this connectivity? Do the authors think local ascending inhibitions contribute to rostrocaudal propagation, and how?

      The long-range ascending ipsilateral inhibitory connections arises from a limitation of our modeling framework. The V1 neurons that provide these connections have been shown experimentally to fire later than other neurons (especially descending V2a  neurons) within the same hemisegment (Jay et al., J Neurosci, 2023); however, our model can only produce synchronized local activity. Hence, we replace local phase offsets with spatial offsets to produce correctly structured recurrent phasic inputs. We are currently investigating a data-driven method for determining intrasegmental connectivity which should be able to produce the local phase offset and address this concern; however, this is beyond the scope of the current paper.

      (3) In the two-population model, the authors show independent control of frequency and rhythm, as has been reported experimentally. However, in these previous experimental studies, frequency and amplitude are regulated by different neurons, suggesting different networks dedicated to frequency and amplitude control. However, in the current model, the same population with the same connections can contribute to frequency or amplitude depending on relative tonic drive. Can the authors please address these differences either by changes in the model or by adding to the Discussion?

      Our prior  experimental results that suggested a separation of frequency and amplitude control circuits focus on motor neuron recruitment, instead of interneuron activity (Jay et al., J Neurosci 2023; Menelaou and McLean, Nat Commun 2019). To avoid potential confusion about amplitudes of interneurons vs. of motor neurons, we have removed the results from Figure 3 about control of amplitude in the 2-population model, instead focusing this figure on the control of frequency via speed-module recruitment. For the same reason, we have removed the panel showing the effects of targeted ablations on interneuron amplitudes in Figure 7. We have kept the result about amplitude control in our Supplemental Figure S2 for the 8-population model, but we try to make it clear in the text that any relationship between interneuron amplitude and motor neuron amplitude would depend on how motor neurons are modeled, which we do not pursue in this work.

      (4) It would be helpful to add a paragraph in the Discussion on how these results could be applicable to other model systems beyond zebrafish. Cell intrinsic rhythmogenesis is a popular concept in the field, and these results show an interesting and novel alternative. It would help to know if there is any experimental evidence suggesting such network-based propagation in other systems, invertebrates, or vertebrates.

      We have expanded a paragraph in the Discussion to address these questions. In particular, we highlight how a recent study of mouse locomotor circuits produced a model with similar key features (Komi et al., 2024). These authors made direct use of experimentally determined connectivity structure and cell-type distributions, which informed a model that produced purely network-based rhythmogenesis. We also point out that inhibition-dominated connectivity has been used for understanding oscillatory behavior in neural circuits outside the context of motor control (Zhang, 1996; Samsonovich and McNaughton, 1997; Murray and Escola, 2017). Finally, we address a study that used the cell-type specific connectivity within the C. Elegans locomotor circuit as the architecture for an artificial motor control system and found that the resulting system could more efficiently learn motor control tasks than general machine learning architectures (Bhattasali et al. 2022). Like our model, the Komi et al. and Bhattasali et al. models generate rhythm via structured connectivity motifs rather than via intracellular dynamical properties, suggesting that these may be a key mechanism underlying locomotion across species.

      Reviewer #1 (Recommendations for the authors):

      (1) Express this modeling construct in a simple biophysical model.

      See the new Results subsection titled “Robustness in a biophysical model.”

      (2) Please cite the classic models of Kopell, Ermentrout, Williams, Sigvardt etc., especially where you say "classic models".

      We have added relevant citations including the mentioned authors.

      (3) "Rhythmogenesis remain incompletely understood" changed to "Rhythmogenesis remains incompletely understood".

      We chose not to make this change since the ‘remain’ refers to the plural ‘core mechanisms’ not the singular ‘rhythmogenesis’.

      Reviewer #3 (Recommendations for the authors):

      (1) The figures are well made; however, it would help to add more details to the figure legends. For example, what neuron's firing rate is shown in Figure 1C? What is the red dot in 1B? Figures 3E,F,G: what is being plotted? Mean and SD? Blue dot in Figure 5C?

      All figure captions have been updated to enhance clarity and address these concerns.

      (2) A, B text missing in Figure 7.

      We have revised this figure and its caption; please see our response to Comment 3 above.

      (3) It would be nice to see the tonic drive pattern that is fed to the model for each case, along with the different firing rates in the figures. It would help understand how the tonic drive is changed to rhythmic activity.

      The tonic drive in the rate models is implemented as a constant excitatory input that is uniform across all units within the same speed-population. There is no patterning in time or location to this drive.

      References

      (1) Moneeza A Agha, Sandeep Kishore, and David L McLean. Cell-type-specific origins of locomotor rhythmicity at different speeds in larval zebrafish. eLife, July 2024

      (2) Nikhil Bhattasali, Anthony M Zador, and Tatiana Engel. Neural circuit architectural priors for embodied control. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 12744–12759. Curran Associates, Inc., 2022.

      (3) Salif Komi, August Winther, Grace A. Houser, Roar Jakob Sørensen, Silas Dalum Larsen, Madelaine C. Adamssom Bonfils, Guanghui Li, and Rune W. Berg. Spatial and network principles behind neural generation of locomotion. bioRxiv, 2024

      (4) James M Murray and G Sean Escola. Learning multiple variable-speed sequences in striatum via cortical tutoring. eLife, 6:e26084, May 2017.

      (5) Alexei Samsonovich and Bruce L McNaughton. Path integration and cognitive mapping in a continuous attractor neural network model. Journal of Neuroscience, 17(15):5900–5920, 1997.

      (6) K Zhang. Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: a theory. Journal of Neuroscience, 16(6):2112–2126, 1996.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      We thank the Reviewers for their thorough attention to our paper and the interesting discussion about the findings. Before responding to more specific comments, here some general points we would like to clarify:

      (1) Ecological niche models are indeed correlative models, and we used them to highlight environmental factors associated with HPAI outbreaks within two host groups. We will further revise the terminology that could still unintentionally suggest causal inference. The few remaining ambiguities were mainly in the Discussion section, where our intent was to interpret the results in light of the broader scientific literature. Particularly, we will change the following expressions:

      -  “Which factors can explain…” to  “Which factors are associated with…” (line 75);

      -  “the environmental and anthropogenic factors influencing” to “the environmental and anthropogenic factors that are correlated with” (line 273);

      -  “underscoring the influence” to “underscoring the strong association” (line 282).

      (2) We respectfully disagree with the suggestion that an ecological niche modelling (ENM) approach is not appropriate for this work and the research question addressed therein. Ecological niche models are specifically designed to estimate the spatial distribution of the environmental suitability of species and pathogens, making them well suited to our research questions. In our study, we have also explicitly detailed the known limitations of ecological niche models in the Discussion section, in line with prior literature, to ensure their appropriate interpretation in the context of HPAI.

      (3) The environmental layers used in our models were restricted to those available at a global scale, as listed in Supplementary Information Resources S1 (https://github.com/sdellicour/h5nx\_risk\_mapping/blob/master/Scripts\_%26\_data/SI\_Resource\_S1.xlsx). Naturally, not all potentially relevant environmental factors could be included, but the selected layers are explicitly documented and only these were assessed for their importance. Despite this limitation, the performance metrics indicate that the models performed well, suggesting that the chosen covariates capture meaningful associations with HPAI occurrence at a global scale.

      Reviewer #1 (Public review):

      The authors aim to predict ecological suitability for transmission of highly pathogenic avian influenza (HPAI) using ecological niche models. This class of models identify correlations between the locations of species or disease detections and the environment. These correlations are then used to predict habitat suitability (in this work, ecological suitability for disease transmission) in locations where surveillance of the species or disease has not been conducted. The authors fit separate models for HPAI detections in wild birds and farmed birds, for two strains of HPAI (H5N1 and H5Nx) and for two time periods, pre- and post-2020. The authors also validate models fitted to disease occurrence data from pre-2020 using post-2020 occurrence data. I thank the authors for taking the time to respond to my initial review and I provide some follow-up below.

      Detailed comments:

      In my review, I asked the authors to clarify the meaning of "spillover" within the HPAI transmission cycle. This term is still not entirely clear: at lines 409-410, the authors use the term with reference to transmission between wild birds and farmed birds, as distinct to transmission between farmed birds. It is implied but not explicitly stated that "spillover" is relevant to the transmission cycle in farmed birds only. The sentence, "we developed separate ecological niche models for wild and domestic bird HPAI occurrences ..." could have been supported by a clear sentence describing the transmission cycle, to prime the reader for why two separate models were necessary.

      We respectfully disagree that the term “spillover” is unclear in the manuscript. In both the Methods and Discussion sections (lines 387-391 and 409-414), we explicitly define “spillover” as the introduction of HPAI viruses from wild birds into domestic poultry, and we distinguish this from secondary farm-to-farm transmission. Our use of separate ecological niche models for wild and domestic outbreaks reflects not only the distinction between primary spillover and secondary transmission, but also the fundamentally different ecological processes, surveillance systems, and management implications that shape outbreaks in these two groups. We will clarify this choice in the revised manuscript when introducing the separate models. Furthermore, on line 83, we will add “as these two groups are influenced by different ecological processes, surveillance biases, and management contexts”.

      I also queried the importance of (dead-end) mammalian infections to a model of the HPAI transmission risk, to which the authors responded: "While spillover events of HPAI into mammals have been documented, these detections are generally considered dead-end infections and do not currently represent sustained transmission chains. As such, they fall outside the scope of our study, which focuses on avian hosts and models ecological suitability for outbreaks in wild and domestic birds." I would argue that any infections, whether they are in dead-end or competent hosts, represent the presence of environmental conditions to support transmission so are certainly relevant to a niche model and therefore within scope. It is certainly understandable if the authors have not been able to access data of mammalian infections, but it is an oversight to dismiss these infections as irrelevant.

      We understand the Reviewer’s point, but our study was designed to model HPAI occurrence in avian hosts only. We therefore restricted our analysis to wild birds and domestic poultry, which represent the primary hosts for HPAI circulation and the focus of surveillance and control measures. While mammalian detections have been reported, they are outside the scope of this work.

      Correlative ecological niche models, including BRTs, learn relationships between occurrence data and covariate data to make predictions, irrespective of correlations between covariates. I am not convinced that the authors can make any "interpretation" (line 298) that the covariates that are most informative to their models have any "influence" (line 282) on their response variable. Indeed, the observation that "land-use and climatic predictors do not play an important role in the niche ecological models" (line 286), while "intensive chicken population density emerges as a significant predictor" (line 282) begs the question: from an operational perspective, is the best (e.g., most interpretable and quickest to generate) model of HPAI risk a map of poultry farming intensity?

      We agree that poultry density may partly reflect reporting bias, but we also assumed it a meaningful predictor of HPAI risk. Its importance in our models is therefore expected. Importantly, our BRT framework does more than reproduce poultry distribution: it captures non-linear relationships and interactions with other covariates, allowing a more nuanced characterisation of risk than a simple poultry density map. Note also that we distinguished in our models intensive and extensive chicken poultry density and duck density. Therefore, it is not a “map of poultry farming intensity”. 

      At line 282, we used the word “influence” while fully recognising that correlative models cannot establish causality. Indeed, in our analyses, “relative influence” refers to the importance metric produced by the BRT algorithm (Ridgeway, 2020), which measures correlative associations between environmental factors and outbreak occurrences. These scores are interpreted in light of the broader scientific literature, therefore our interpretations build on both our results and existing evidence, rather than on our models alone. However, in the next version of the paper, we will revise the sentence as: “underscoring the strong association of poultry farming practices with HPAI spread (Dhingra et al., 2016)”. 

      I have more significant concerns about the authors' treatment of sampling bias: "We agree with the Reviewer's comment that poultry density could have potentially been considered to guide the sampling effort of the pseudo-absences to consider when training domestic bird models. We however prefer to keep using a human population density layer as a proxy for surveillance bias to define the relative probability to sample pseudo-absence points in the different pixels of the background area considered when training our ecological niche models. Indeed, given that poultry density is precisely one of the predictors that we aim to test, considering this environmental layer for defining the relative probability to sample pseudo-absences would introduce a certain level of circularity in our analytical procedure, e.g. by artificially increasing to influence of that particular variable in our models." The authors have elected to ignore a fundamental feature of distribution modelling with occurrence-only data: if we include a source of sampling bias as a covariate and do not include it when we sample background data, then that covariate would appear to be correlated with presence. They acknowledge this later in their response to my review: "...assuming a sampling bias correlated with poultry density would result in reducing its effect as a risk factor." In other words, the apparent predictive capacity of poultry density is a function of how the authors have constructed the sampling bias for their models. A reader of the manuscript can reasonably ask the question: to what degree are is the model a model of HPAI transmission risk, and to what degree is the model a model of the observation process? The sentence at lines 474-477 is a helpful addition, however the preceding sentence, "Another approach to sampling pseudo-absences would have been to distribute them according to the density of domestic poultry," (line 474) is included without acknowledgement of the flow-on consequence to one of the key findings of the manuscript, that "...intensive chicken population density emerges as a significant predictor..." (line 282). The additional context on the EMPRES-i dataset at line 475-476 ("the locations of outbreaks ... are often georeferenced using place name nomenclatures") is in conflict with the description of the dataset at line 407 ("precise location coordinates"). Ultimately, the choices that the authors have made are entirely defensible through a clear, concise description of model features and assumptions, and precise language to guide the reader through interpretation of results. I am not satisfied that this is provided in the revised manuscript.

      We thank the Reviewer for this important point. To address it, we compared model predictive performance and covariate relative influences obtained when pseudo-absences were weighted by poultry density versus human population density (Author response table 1). The results show that differences between the two approaches are marginal, both in predictive performance (ΔAUC ranging from -0.013 to +0.002) and in the ranking of key predictors (see below Author response images 1 and 2). For instance, intensive chicken density consistently emerged as an important predictor regardless of the bias layer used.

      Note: the comparison was conducted using a simplified BRT configuration for computational efficiency (fewer trees, fixed 5-fold random cross-validation, and standardised parameters). Therefore, absolute values of AUC and variable importance may differ slightly from those in the manuscript, but the relative ranking of predictors and the overall conclusions remain consistent.

      Given these small differences, we retained the approach using human population density. We agree that poultry density partly reflects surveillance bias as well as true epidemiological risk, and we will clarify this in the revised manuscript by noting that the predictive role of poultry density reflects both biological processes and surveillance systems. Furthermore, on line 289, we will add “We note, however, that intensive poultry density may reflect both surveillance intensity and epidemiological risk, and its predictive role in our models should be interpreted in light of both processes”.

      Author response table 1.

      Comparison of model predictive performances (AUC) between pseudo-absence sampling were weighted by poultry density and by human population density across host groups, virus types, and time periods. Differences in AUC values are shown as the value for poultry-weighted minus human-weighted pseudo-absences.

      Author response image 1.

      Comparison of variable relative influence (%) between models trained with pseudo-absences weighted by poultry density (red) and human population density (blue) for domestic bird outbreaks. Results are shown for four datasets: H5N1 (<2020), H5N1 (>2020), H5Nx (<2020), and H5Nx (>2020).

      Author response image 2.

      Comparison of variable relative influence (%) between models trained with pseudo-absences weighted by poultry density (red) and human population density (blue) for wild bird outbreaks. Results are shown for three datasets: H5N1 (>2020), H5Nx (<2020), and H5Nx (>2020).

      The authors have slightly misunderstood my comment on "extrapolation": I referred to "environmental extrapolation" in my review without being particularly explicit about my meaning. By "environmental extrapolation", I meant to ask whether the models were predicting to environments that are outside the extent of environments included in the occurrence data used in the manuscript. The authors appear to have understood this to be a comment on geographic extrapolation, or predicting to areas outside the geographic extent included in occurrence data, e.g.: "For H5Nx post-2020, areas of high predicted ecological suitability, such as Brazil, Bolivia, the Caribbean islands, and Jilin province in China, likely result from extrapolations, as these regions reported few or no outbreaks in the training data" (lines 195-197). Is the model extrapolating in environmental space in these regions? This is unclear. I do not suggest that the authors should carry out further analysis, but the multivariate environmental similarly surface (MESS; see Elith et al., 2010) is a useful tool to visualise environmental extrapolation and aid model interpretation.

      On the subject of "extrapolation", I am also concerned by the additions at lines 362-370: "...our models extrapolate environmental suitability for H5Nx in wild birds in areas where few or no outbreaks have been reported. This discrepancy may be explained by limited surveillance or underreporting in those regions." The "discrepancy" cited here is a feature of the input dataset, a function of the observation distribution that should be captured in pseudo-absence data. The authors state that Kazakhstan and Central Asia are areas of interest, and that the environments in this region are outside the extent of environments captured in the occurrence dataset, although it is unclear whether "extrapolation" is informed by a quantitative tool like a MESS or judged by some other qualitative test. The authors then cite Australia as an example of a region with some predicted suitability but no HPAI outbreaks to date, however this discussion point is not linked to the idea that the presence of environmental conditions to support transmission need not imply the occurrence of transmission (as in the addition, "...spatial isolation may imply a lower risk of actual occurrences..." at line 214). Ultimately, the authors have not added any clear comment on model uncertainty (e.g., variation between replicated BRTs) as I suggested might be helpful to support their description of model predictions.

      Many thanks for the clarification. Indeed, we interpreted your previous comments in terms of geographic extrapolations. We thank the Reviewer for these observations. We will adjust the wording to further clarify that predictions of ecological suitability in areas with few or no reported outbreaks (e.g., Central Asia, Australia) are not model errors but expected extrapolations, since ecological suitability does not imply confirmed transmission (for instance, on Line 362: “our models extrapolate environmental suitability” will be changed to “Interestingly, our models extrapolate geographical”). These predictions indicate potential environments favorable to circulation if the virus were introduced.

      In our study, model uncertainty is formally assessed when comparing the predictive performances of our models (Fig. S3, Table S1), the relative influence (Table S3) and response curves (Fig. 2) associated with each environmental factor (Table S2). All the results confirming a good converge between these replicates. Finally, we indeed did not use a quantitative tool such as a MESS to assess extrapolation but did rely on qualitative interpretation of model outputs.

      All of my criticisms are, of course, applied with the understanding that niche modelling is imperfect for a disease like HPAI, and that data may be biased/incomplete, etc.: these caveats are common across the niche modelling literature. However, if language around the transmission cycle, the niche, and the interpretation of any of the models is imprecise, which I find it to be in the revised manuscript, it undermines all of the science that is presented in this work.

      We respectfully disagree with this comment. The scope of our study and the methods employed are clearly defined in the manuscript, and the limitations of ecological niche modelling in this context are explicitly acknowledged in the Discussion section. While we appreciate the Reviewer’s concern, the comment does not provide specific examples of unclear or imprecise language regarding the transmission cycle, niche, or interpretation of the models. Without such examples, it is difficult to identify further revisions that would improve clarity.

      Reviewer #2 (Public review):

      The geographic range of highly pathogenic avian influenza cases changed substantially around the period 2020, and there is much interest in understanding why. Since 2020 the pathogen irrupted in the Americas and the distribution in Asia changed dramatically. This study aimed to determine which spatial factors (environmental, agronomic and socio-economic) explain the change in numbers and locations of cases reported since 2020 (2020--2023). That's a causal question which they address by applying correlative environmental niche modelling (ENM) approach to the avian influenza case data before (2015--2020) and after 2020 (2020--2023) and separately for confirmed cases in wild and domestic birds. To address their questions they compare the outputs of the respective models, and those of the first global model of the HPAI niche published by Dhingra et al 2016.

      We do not agree with this comment. In the manuscript, it is well established that we are quantitatively assessing factors that are associated with occurrences data before and after 2020. We do not claim to determine the causality. One sentence of the Introduction section (lines 75-76) could be confusing, so we intend to modify it in the final revision of our manuscript. 

      ENM is a correlative approach useful for extrapolating understandings based on sparse geographically referenced observational data over un- or under-sampled areas with similar environmental characteristics in the form of a continuous map. In this case, because the selected covariates about land cover, use, population and environment are broadly available over the entire world, modelled associations between the response and those covariates can be projected (predicted) back to space in the form of a continuous map of the HPAI niche for the entire world.

      We fully agree with this assessment of ENM approaches.

      Strengths:

      The authors are clear about expected bias in the detection of cases, such geographic variation in surveillance effort (testing of symptomatic or dead wildlife, testing domestic flocks) and in general more detections near areas of higher human population density (because if a tree falls in a forest and there is no-one there, etc), and take steps to ameliorate those. The authors use boosted regression trees to implement the ENM, which typically feature among the best performing models for this application (also known as habitat suitability models). They ran replicate sets of the analysis for each of their model targets (wild/domestic x pathogen variant), which can help produce stable predictions. Their code and data is provided, though I did not verify that the work was reproducible.

      The paper can be read as a partial update to the first global model of H5Nx transmission by Dhingra and others published in 2016 and explicitly follows many methodological elements. Because they use the same covariate sets as used by Dhingra et al 2016 (including the comparisons of the performance of the sets in spatial cross-validation) and for both time periods of interest in the current work, comparison of model outputs is possible. The authors further facilitate those comparisons with clear graphics and supplementary analyses and presentation. The models can also be explored interactively at a weblink provided in text, though it would be good to see the model training data there too.

      The authors' comparison of ENM model outputs generated from the distinct HPAI case datasets is interesting and worthwhile, though for me, only as a response to differently framed research questions.

      Weaknesses:

      This well-presented and technically well-executed paper has one major weakness to my mind. I don't believe that ENM models were an appropriate tool to address their stated goal, which was to identify the factors that "explain" changing HPAI epidemiology.

      Here is how I understand and unpack that weakness:

      (1) Because of their fundamentally correlative nature, ENMs are not a strong candidate for exploring or inferring causal relationships.

      (2) Generating ENMs for a species whose distribution is undergoing broad scale range change is complicated and requires particular caution and nuance in interpretation (e.g., Elith et al, 2010, an important general assumption of environmental niche models is that the target species is at some kind of distributional equilibrium (at time scales relevant to the model application). In practice that means the species has had an opportunity to reach all suitable habitats and therefore its absence from some can be interpreted as either unfavourable environment or interactions with other species). Here data sets for the response (N5H1 or N5Hx case data in domestic or wild birds ) were divided into two periods; 2015--2020, and 2020--2023 based on the rationale that the geographic locations and host-species profile of cases detected in the latter period was suggestive of changed epidemiology. In comparing outputs from multiple ENMs for the same target from distinct time periods the authors are expertly working in, or even dancing around, what is a known grey area, and they need to make the necessary assumptions and caveats obvious to readers.

      We thank the Reviewer for this observation. First, we constrained pseudo-absence sampling to countries and regions where outbreaks had been reported, reducing the risk of interpreting non-affected areas as environmentally unsuitable. Second, we deliberately split the outbreak data into two periods (2015-2020 and 2020-2023) because we do not assume a single stable equilibrium across the full study timeframe. This division reflects known epidemiological changes around 2020 and allows each period to be modeled independently. Within each period, ENM outputs are interpreted as associations between outbreaks and covariates, not as equilibrium distributions. Finally, by testing prediction across periods, we assessed both niche stability and potential niche shifts. These clarifications will be added to the manuscript to make our assumptions and limitations explicit.

      Line 66, we will add: “Ecological niche model outputs for range-shifting pathogens must therefore be interpreted with caution (Elith et al., 2010). Despite this limitation, correlative ecological niche models  remain useful for identifying broad-scale associations and potential shifts in distribution. To account for this, we analysed two distinct time periods (2015-2020 and 2020-2023).”

      Line 123, we will revise “These findings underscore the ability of pre-2020 models in forecasting the recent geographic distribution of ecological suitability for H5Nx and H5N1 occurrences” to “These results suggest that pre-2020 models captured broad patterns of suitability for H5Nx and H5N1 outbreaks, while post-2020 models provided a closer fit to the more recent epidemiological situation”.

      (3) To generate global prediction maps via ENM, only variables that exist at appropriate resolution over the desired area can be supplied as covariates. What processes could influence changing epidemiology of a pathogen and are their covariates that represent them? Introduction to a new geographic area (continent) with naive population, immunity in previously exposed populations, control measures to limit spread such as vaccination or destruction of vulnerable populations or flocks? Might those control measures be more or less likely depending on the country as a function of its resources and governance? There aren't globally available datasets that speak to those factors, so the question is not why were they omitted but rather was the authors decision to choose ENMs given their question justified? How valuable are insights based on patterns of correlation change when considering different temporal sets of HPAI cases in relation to a common and somewhat anachronistic set of covariates?

      We agree that the ecological niche models trained in our study are limited to environmental and host factors, as described in the Methods section with the selection of predictors. While such models cannot capture causality or represent processes such as immunity, control measures, or governance, they remain a useful tool for identifying broad associations between outbreak occurrence and environmental context. Our study cannot infer the full mechanisms driving changes in HPAI epidemiology, but it does provide a globally consistent framework to examine how associations with available covariates vary across time periods.

      (4) In general the study is somewhat incoherent with respect to time. Though the case data come from different time periods, each response dataset was modelled separately using exactly the same covariate dataset that predated both sets. That decision should be understood as a strong assumption on the part of the authors that conditions the interpretation: the world (as represented by the covariate set) is immutable, so the model has to return different correlative associations between the case data and the covariates to explain the new data. While the world represented by the selected covariates \*may\* be relatively stable (could be statistically confirmed), what about the world not represented by the covariates (see point 3)?

      We used the same covariate layers for both periods, which indeed assumes that these environmental and host factors are relatively stable at the global scale over the short timeframe considered. We believe this assumption is reasonable, as poultry density, land cover, and climate baselines do not change drastically between 2015 and 2023 at the resolution of our analysis. We agree, however, that unmeasured processes such as control measures, immunity, or governance may have changed during this time and are not captured by our covariates.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      - Line 400-401: "over the 2003-2016 periods" has an extra "s"; "two host species" (with reference to wild and domestic birds) would be more precise as "two host groups".

      - Remove comma line 404

      Many thanks for these comments, we have modified the text accordingly.

      Reviewer #2 (Recommendations for the authors):

      Most of my work this round is encapsulated in the public part of the review.

      The authors responded positively to the review efforts from the previous round, but I was underwhelmed with the changes to the text that resulted. Particularly in regard to limiting assumptions - the way that they augmented the text to refer to limitations raised in review downplayed the importance of the assumptions they've made. So they acknowledge the significance of the limitation in their rejoinder, but in the amended text merely note the limitation without giving any sense of what it means for their interpretation of the findings of this study.

      The abstract and findings are essentially unchanged from the previous draft.

      I still feel the near causal statements of interpretation about the covariates are concerning. These models really are not a good candidate for supporting the inference that they are making and there seem to be very strong arguments in favour of adding covariates that are not globally available.

      We never claimed causal interpretation, and we have consistently framed our analyses in terms of associations rather than mechanisms. We acknowledge that one phrasing in the research questions (“Which factors can explain…”) could be misinterpreted, and we are correcting this in the revised version to read “Which factors are associated with…”. Our approach follows standard ecological niche modelling practice, which identifies statistical associations between occurrence data and covariates. As noted in the Discussion section, these associations should not be interpreted as direct causal mechanisms. Finally, all interpretive points in the manuscript are supported by published literature, and we consider this framing both appropriate and consistent with best practice in ecological niche modelling (ENM) studies.

      We assessed predictor contributions using the “relative influence” metric, the terminology reported by the R package “gbm” (Ridgeway, 2020). This metric quantifies the contribution of each variable to model fit across all trees, rescaled to sum to 100%, and should be interpreted as an association rather than a causal effect.

      L65-66 The general difficulty of interpreting ENM output with range-shifting species should be cited here to alert readers that they should not blithely attempt what follows at home.

      I believe that their analysis is interesting and technically very well executed, so it has been a disappointment and hard work to write this assessment. My rough-cut last paragraph of a reframed intro would go something like - there are many reasons in the literature not to do what we are about to do, but here's why we think it can be instructive and informative, within certain guardrails.

      To acknowledge this comment and the previous one, we revised lines 65-66 to: “However, recent outbreaks raise questions about whether earlier ecological niche models still accurately predict the current distribution of areas ecologically suitable for the local circulation of HPAI H5 viruses. Ecological niche model outputs for range-shifting pathogens must therefore be interpreted with caution (Elith et al., 2010). Despite this limitation, correlative ecological niche models  remain useful for identifying broad-scale associations and potential shifts in distribution.”

      We respectfully disagree with the Reviewer’s statement that “there are many reasons in the literature not to do what we are about to do”. All modeling approaches, including mechanistic ones, have limitations, and the literature is clear on both the strengths and constraints of ecological niche models. Our manuscript openly acknowledges these limits and frames our findings accordingly. We therefore believe that our use of an ENM approach is justified and contributes valuable insights within these well-defined boundaries.

      Reference: Ridgeway, G. (2007). Generalized Boosted Models: A guide to the gbm package. Update, 1(1), 2007.

    1. Author response:

      We thank the reviewers and editors for their insightful comments on our manuscript. We intend to submit a revised manuscript that addresses all concerns raised by the reviewers. A major limitation identified by the reviewers was our inability to identify one or more specific mechanosensitive GPCRs in lymphatic muscle cells (LMCs). To address this concern, we plan to include several additional figures in the revised manuscript. One figure will list the 136 GPCRs identified in LMCs by our scRNAseq analysis, based on the list of validated GPCRs in https://esbl.nhlbi.nih.gov/Databases/GPCRs/index.html and olfactory GPCRs listed in https://esbl.nhlbi.nih.gov/Databases/GPCRs/MouseHumanRatORs.html. We plan to arrange the data in a hierarchical manner according to their expression level and denote their heterotrimeric GTP-binding protein alpha subunit(s), if known. To reinforce our finding that pressure-induced chronotropy in LMCs is mediated through Gq/11, we will present additional data testing the effects of acute Gq/11  inhibition with YM-254890 (a selective Gq/11 inhibitor) on the frequency-pressure relationship of popliteal vessels, as suggested by one reviewer. We will address concerns regarding the potential regional differences in lymphatic contractile regulation arising from our use of popliteal lymphatic vessels for contraction assays and expression analysis of LMCs obtained from Inguinal-Axillary lymphatic vessels (IALVs). To account for possible differences between the two, we will test pressure responses of IALVs from double Gq/11 knockout mice and test responses of wild-type IALVs to acute administration of YM-25489.

      Our preliminary analysis of the 136 GPCRs in LMCs revealed a shorter list of 10 GPCRs that are expressed in at least 50% of LMCs (based on the IALV scRNAseq dataset). Since existing evidence from our studies, and those of other investigators, suggests that any LMC is capable of initiating pacemaking, we consider it reasonable to impose this requirement.

      Author response table 1.

      We plan to use pharmacologic inhibitors to test as many of these candidates as possible. Unfortunately, inhibitors are not available for many of the GPCRs listed above, but we will test Npr3, Npy1R, and Ednra; a negative result for Tbxa2r has already been documented in a previous study (Schulz et al. ATVB 2025). Even if this strategy does not lead to identification of one or more specific GPCRs involved in LMC pressure transduction, it will narrow the list of possible candidates that need to be tested in future experiments.

    1. Author response:

      We thank the reviewers and editors for the careful evaluation of our manuscript. Below, we provide a first refutation of some of the concerns expressed by reviewers.

      Both reviewer 1 &3 underscore the importance of controlling for genetic backgrounds. This is actually an issue only for a limited part of the study and this criticism should not apply to major findings of this study, with some exceptions, as detailed below.

      It is important to note that we have identified ourselves several of the mutant lines we have been using. For instance, key and MyD88 mutant alleles have been identified in the Exelixis transposon insertion collection that we have screened in collaboration with this firm (e.g., [3, 4, 5]). This resource has been generated in a isogenized w [A5001] strain[6], which we are using as matched control for these mutants (Figs 1B,D). Of note, while they share a common genetic background, the phenotypes of key and MyD88 are opposite in terms of sensitivity to OMV challenge. The imd<sup>shadok</sup> null allele had been identified during our chemical mutagenesis screen with EMS in a yw cn bw background [5, 7, 8, 9], which was used as a control (FigS1A).

      With respect to Hayan (Fig. 2C, Fig. S2C) and eater (Fig. S2A-B) mutants[10, 11, 12], we find a similarly strong phenotype with two independent mutants in distinct genetic backgrounds (actually three for Hayan, as we have not included in our original manuscript the Hayan<sup>SK3</sup>allele generated in the Lemaitre laboratory in which OMVs displayed also impaired virulence). We have shown that the Hayan mutants do display the expected phenotype in terms of PPO cleavage (Fig. S2D). Please, also note that in Fig. S2C the two mutant alleles are tested in the same experiment: even though there is some variation between the w<sup>1118</sup> and the w[A5001] strains, the two mutants behave in a remarkably similar manner. As regards the role of the cellular response, we note that we obtained results similar to those obtained with eater mutants using genetic ablation of hemocytes (Fig. 2A) or by saturating the phagocytosis apparatus (Fig. 2B), a confirmation by two totally-independent approaches.

      Of note, the observed eater and Hayan phenotypes are strong and not relatively small and thus unlikely to be due to the genetic background.

      The PPO mutants have been isogenized in the w<sup>1118</sup> by the lab of Bruno Lemaitre[13, 14] and are also validated biochemically in Fig. S2D. These mutants have been extensively tested in the Lemaitre laboratory[13, 14, 15].

      With respect to RNAi silencing driven ubiquitously or in specific tissues using the UAS-Gal4 system, we have mostly used transgenes from the Trip collection and have used as a control the mCherry RNAi provided by this resource[16]. As the RNAi transgenes have been generated in the same genetic background, it follows that independently of the driver used, the genetic background used in mCherry and genes-of-interest (Duox, Nox, Jafrac2) silenced flies is controlled for (Fig. 3D,E).

      For UAS-Gal4-mediated overexpression of fly superoxide dismutase genes, we have used SOD1 and SOD2 transgenes that have both been generated by the same laboratory (Phillips laboratory, University of Guelph) presumably in the same genetic background. Using two distinct drivers we find a strongly enhanced susceptibility phenotype when using UAS-SOD2 but not UAS-SOD1 transgenes (Fig. 3F, Fig. 4E). Importantly, the former is associated with mitochondria whereas the other is expressed in the endoplasmic reticulum: we independently confirm this phenotype using the mitoTempo mitochondrial ROS inhibitor.

      We shall thus address the criticism with NOS mutants, where genetic background control is indeed critical and for the UAS-kay RNAi line using a Trip line and its associated mCherry RNAi control transgene.

      With respect to the Toll pathway mutants, we agree that some of the variability of the phenotypes may be due to the genetic background, especially as regards tube and pelle. The SPE and grass mutants have been retrieved in a screen performed by the group of Jean-Marc Reichhart in our Research Unit. They thus have been generated in the same genetic background, yet grass displays a mildly decreased virulence of injected OMVs whereas SPE mutants display an opposite phenotype (compare Fig. S1E to S1I; the survival experiment shave been performed in the same set of experiments and have been separated for clarity). We do not intend to analyze further the mutants of the Toll pathway as our data suggest that the canonical Toll pathway, likely activated through psh (Fig. S1F) appears to be activated to detectable levels too late by comparison with the time course of OMV pathogenicity. In our opinion, the contribution of the Toll pathway in the host defense against OMV pathogenicity is minor, albeit we acknowledge that some of the findings, especially with SPE are puzzling.

      With respect to the IMD pathway, we shall test also PGRP-LC and Relish mutants, as suggested by reviewers 2&3.

      Reviewer 2 query: “It is unclear how many Serratia marcescens cells a 69 nL injection of 0.1 ng/nL OMVs corresponds to.”

      OMVs were purified from 600 mL of SmDb11 cultures grown to an average OD<sub>600</sub> of 2.0. Based on a cell density of 0.8 × 10<sup>8</sup> cells/mL per OD unit, this corresponds to approximately 9.6 × 10<sup>10</sup> total bacterial cells.

      Each OMV preparation was concentrated into a final volume of 400 µL, resulting in a concentration factor of ~1500× relative to the original culture. Therefore, an injection dose of 69 nL of OMVs is equivalent to 0.1 mL of the starting bacterial culture, which corresponds to:

      0.2 OD units

      Approximately 1.6 × 10<sup>7</sup> bacterial cells

      It is likely that such high concentrations occur only toward the end of the infection, if OMVs are produced at the same rate in the host and in vitro.

      With respect to other Reviewer 2 queries, we shall give a try at labeling OMVs with the FM4-64 lipophilic dye and examining whether they are taken up by hemocytes. However, an issue may arise with potentially high background, which has been encountered in cell culture. Of note, OMVs are known to attack cultured human THP1 cells, a monocyte cell line [17].Of note, determining whether OMVs are taken up by hemocytes may only be a starting point to understand how they promote the pathogenicity of OMVs. This question constitutes the topic of a full study that we are currently unable to undertake.

      We shall also test whether we can document phospho-JNK expression in neural tissues.

      Finally, we shall also confirm the data obtained with two elav-Gal4 drivers (including an inducible one) with the nsyb-Gal4 driver line.

      References

      (1) Xu R, et al. The Toll pathway mediates Drosophila resilience to Aspergillus mycotoxins through specific Bomanins. EMBO Rep 24, e56036 (2023).

      (2) Huang J, et al. A Toll pathway effector protects Drosophila specifically from distinct toxins secreted by a fungus or a bacterium. Proc Natl Acad Sci U S A 120, e2205140120 (2023).

      (3) Gobert V, et al. Dual Activation of the Drosophila Toll Pathway by Two Pattern Recognition Receptors. Science 302, 2126-2130 (2003).

      (4) Gottar M, et al. Dual Detection of Fungal Infections in Drosophila via Recognition of Glucans and Sensing of Virulence Factors. Cell 127, 1425-1437 (2006).

      (5) Gottar M, et al. The Drosophila immune response against Gram-negative bacteria is mediated by a peptidoglycan recognition protein. Nature 416, 640-644 (2002).

      (6) Thibault ST, et al. A complementary transposon tool kit for Drosophila melanogaster using P and piggyBac. Nat Genet 36, 283-287 (2004).

      (7) Rutschmann S, Jung AC, Hetru C, Reichhart J-M, Hoffmann  JA, Ferrandon D. The Rel protein DIF mediates the antifungal, but not the antibacterial,  response in Drosophila. Immunity 12, 569-580 (2000).

      (8) Rutschmann S, Jung AC, Rui Z, Silverman N, Hoffmann JA, Ferrandon D. Role of Drosophila IKKg in a Toll-independent antibacterial immune response. Nat Immunology 1, 342-347 (2000).

      (9) Jung A, Criqui M-C, Rutschmann S, Hoffmann J-A, Ferrandon D. A microfluorometer assay to measure the expression of ß-galactosidase and GFP reporter genes in single Drosophila flies. Biotechniques 30, 594- 601 (2001).

      (10) Nam HJ, Jang IH, You H, Lee KA, Lee WJ. Genetic evidence of a redox-dependent systemic wound response via Hayan protease-phenoloxidase system in Drosophila. Embo J 31, 1253-1265 (2012).

      (11) Kocks C, et al. Eater, a transmembrane protein mediating phagocytosis of bacterial pathogens in Drosophila. Cell 123, 335-346 (2005).

      (12) Bretscher AJ, et al. The Nimrod transmembrane receptor Eater is required for hemocyte attachment to the sessile compartment in Drosophila melanogaster. Biology open 4, 355-363 (2015).

      (13) Binggeli O, Neyen C, Poidevin M, Lemaitre B. Prophenoloxidase activation is required for survival to microbial infections in Drosophila. PLoS Pathog 10, e1004067 (2014).

      (14) Dudzic JP, Kondo S, Ueda R, Bergman CM, Lemaitre B. Drosophila innate immunity: regional and functional specialization of prophenoloxidases. BMC Biol 13, 81 (2015).

      (15) Dudzic JP, Hanson MA, Iatsenko I, Kondo S, Lemaitre B. More Than Black or White: Melanization and Toll Share Regulatory Serine Proteases in Drosophila. Cell reports 27, 1050-1061 e1053 (2019).

      (16) Perkins LA, et al. The Transgenic RNAi Project at Harvard Medical School: Resources and Validation. Genetics 201, 843-852 (2015).

      (17) Goman A, et al. Uncovering a new family of conserved virulence factors that promote the production of host-damaging outer membrane vesicles in gram-negative bacteria. J Extracell Vesicles 14, e270032 (2025).

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #2 (Public review):

      Summary:

      Using a gerbil model, the authors tested the hypothesis that loss of synapses between sensory hair cells and auditory nerve fibers (which may occur due to noise exposure or aging) affects behavioral discrimination of the rapid temporal fluctuations of sounds. In contrast to previous suggestions in the literature, their results do not support this hypothesis; young animals treated with a compound that reduces the number of synapses did not show impaired discrimination compared to controls. Additionally, their results from older animals showing impaired discrimination suggest that age-related changes aside from synaptopathy are responsible for the age-related decline in discrimination.

      Strengths:

      (1) The rationale and hypothesis are well-motivated and clearly presented.

      (2) The study was well conducted with strong methodology for the most part, and good experimental control. The combination of physiological and behavioral techniques is powerful and informative. Reducing synapse counts fairly directly using ouabain is a cleaner design than using noise exposure or age (as in other studies), since these latter modifiers have additional effects on auditory function.

      (3) The study may have a considerable impact on the field. The findings could have important implications for our understanding of cochlear synaptopathy, one of the most highly researched and potentially impactful developments in hearing science in the past fifteen years.

      Weaknesses:

      (1) I have concerns that the gerbils may not have been performing the behavioral task using temporal fine structure information.

      Human studies using the same task employed a filter center frequency that was (at least) 11 times the fundamental frequency (Marmel et al., 2015; Moore and Sek, 2009). Moore and Sek wrote: "the default (recommended) value of the centre frequency is 11F0." Here, the center frequency was only 4 or 8 times the fundamental frequency (4F0 or 8F0). Hence, relative to harmonic frequency, the harmonic spacing was considerably greater in the present study. However, gerbil auditory filters are thought to be broader than those in human. In the revised version of the manuscript, the authors provide modelling results suggesting that the excitation patterns were discriminable for the 4F0 conditions, but may not have been for the 8F0 conditions. These results provide some reassurance that the 8F0 discriminations were dependent on temporal cues, but the description of the model lacks detail. Also, the authors state that "thus, for these two conditions with harmonic number N of 8 the gerbils cannot rely on differences in the excitation patterns but must solve the task by comparing the temporal fine structure." This is too strong. Pulsed tone intensity difference limens (the reference used for establishing whether or not the excitation pattern cues were usable) may not be directly comparable to profile-analysis-like conditions, and it has been argued that frequency discrimination may be more sensitive to excitation pattern cues than predicted from a simple comparison to intensity difference limens (Micheyl et al. 2013, https://doi.org/10.1371/journal.pcbi.1003336

      We can assume that our conclusions based on the excitation patterns are adequate when putting gerbil auditory filter data, frequency difference limens and intensity difference limens together into perspective. Kittel et al. (2002) observed an about factor 2 larger auditory-filter bandwidth in the gerbil than in humans reducing the number of independent frequency channels in the analysis of excitation patterns. The gerbil frequency-difference limen for pure tones being an indicator for the sensitivity to make use of excitation patterns is more than an order of magnitude larger than the corresponding human frequency difference limen (Klinge and Klump 2009, https://doi.org/10.1121/1.3021315). Finally, the gerbil intensity-difference limen of 2.8 dB observed for 1-kHz pure tones is considerably larger than the 0.75 dB observed for humans in the same study (Sinnott et al. 1992). Thus, taken together these lines of evidence indicate that our conclusions regarding the potential use of excitation patterns are not too strong.

      I'm also somewhat concerned that the masking noise used in the present study was too low in level to mask cochlear distortion products. Based on their excitation pattern modelling, the authors state (without citation) that "since the level of excitation produced by the pink noise is less than 30 dB below that produced by the complex tones, distortion products will be masked." The basis for this claim is not clear. In human, distortion products may be only ~20 dB below the levels of the primaries (referenced to an external sound masker / canceller, which is appropriate, assuming that the modelling reported in the present paper did not include middle-ear effects; see Norman-Haignere and McDermott, 2016, doi: 10.1016/j.neuroimage.2016.01.050). Oxenham et al. (2009, doi: 10.1121/1.3089220) provide further cautionary evidence on the potential use of distortion product cues when the background noise level is too low (in their case the relative level of the noise in the compromised condition was only a little below that used in the present study). The masking level used in the present study may have been sufficient, but it would be useful to have some further reassurance on this point.

      In the method section, we provide the citation for estimating the size of the distortion products and the estimated signal-to-noise ratio making the basis for our estimates clear.

      We consulted Oxenham et al. (2009, doi: 10.1121/1.3089220) who suggested that distortion products may have been used in human subjects. However, in Fig. 1 of their paper, they convincingly demonstrate that even for humans that have more narrow auditory filters than gerbils, spectral cues cannot be used to evaluate the frequency shift in harmonic complex tones. We are confident that the same limitation applies to gerbils that have wider auditory filters than humans and a lower ability to use spectral cues as indicated by their higher frequency-difference limens and intensity-difference limens compared to humans.

      (2) The synapse reductions in the high ouabain and old groups were relatively small (mean of 19 synapses per hair cell compared to 23 in the young untreated group). In contrast, in some mouse models of the effects of noise exposure or age, a 50% reduction in synapses is observed, and in the human temporal bone study of Wu et al. (2021, https://doi.org/10.1523/JNEUROSCI.3238-20.2021) the age-related reduction in auditory nerve fibres was ~50% or greater for the highest age group across cochlear location. It could be simply that the synapse loss in the present study was too small to produce significant behavioral effects. Hence, although the authors provide evidence that in the gerbil model the age-related behavioral effects are not due to synaptopathy, this may not translate to other species (including human).

      (3) The study was not pre-registered, and there was no a priori power calculation, so there is less confidence in replicability than could have been the case. Only three old animals were used in the behavioral study, which raises concerns about the reliability of comparisons involving this group.

      Reviewer #3 (Public review):

      This study is a part of the ongoing series of rigorous work from this group exploring neural coding deficits in the auditory nerve, and dissociating the effects of cochlear synaptopathy from other age-related deficits. They have previously shown no evidence of phase-locking deficits in the remaining auditory nerve fibers in quiet-aged gerbils. Here, they study the effects of aging on the perception and neural coding of temporal fine structure cues in the same Mongolian gerbil model.

      They measure TFS coding in the auditory nerve using the TFS1 task which uses a combination of harmonic and tone-shifted inharmonic tones which differ primarily in their TFS cues (and not the envelope). They then follow this up with a behavioral paradigm using the TFS1 task in these gerbils. They test young normal hearing gerbils, aged gerbils, and young gerbils with cochlear synaptopathy induced using the neurotoxin ouabain to mimic synapse losses seen with age.

      In the behavioral paradigm, they find that aging is associated with decreased performance compared to the young gerbils, whereas young gerbils with similar levels of synapse loss do not show these deficits. When looking at the auditory nerve responses, they find no differences in neural coding of TFS cues across any of the groups. However, aged gerbils show an increase in the representation of periodicity envelope cues (around f0) compared to young gerbils or those with induced synapse loss. The authors hence conclude that synapse loss by itself doesn't seem to be important for distinguishing TFS cues, and rather the behavioral deficits with age are likely having to do with the misrepresented envelope cues instead.

      The manuscript is well written, and the data presented are robust. Some of the points below will need to be considered while interpreting the results of the study, in its current form. These considerations are addressable if deemed necessary, with some additional analysis in future versions of the manuscript.

      Spontaneous rates - Figure S2 shows no differences in median spontaneous rates across groups. But taking the median glosses over some of the nuances there. Ouabain (in the Bourien study) famously affects low spont rates first, and at a higher degree than median or high spont rates. It seems to be the case (qualitatively) in figure S2 as well, with almost no units in the low spont region in the ouabain group, compared to the other groups. Looking at distributions within each spont rate category and comparing differences across the groups might reveal some of the underlying causes for these changes. Given that overall, the study reports that low-SR fibers had a higher ENV/TFS log-z-ratio, the distribution of these fibers across groups may reveal specific effects of TFS coding by group.

      [Update: The revised manuscript has addressed these issues]

      Threshold shifts - It is unclear from the current version if the older gerbils have changes in hearing thresholds, and whether those changes may be affecting behavioral thresholds. The behavioral stimuli appear to have been presented at a fixed sound level for both young and aged gerbils, similar to the single unit recordings. Hence, age-related differences in behavior may have been due to changes in relative sensation level. Approaches such as using hearing thresholds as covariates in the analysis will help explore if older gerbils still show behavioral deficits.

      [Update: The issue of threshold shifts with aging gerbils is still unresolved in my opinion. From the revised manuscript, it appears that aged gerbils have a 36dB shift in thresholds. While the revised manuscript provides convincing evidence that these threshold shifts do not affect the auditory nerve tuning properties, the behavioral paradigm was still presented at the same sound level for young and aged animals. But a potential 36 dB change in sensation level may affect behavioral results. The authors may consider adding thresholds as covariates in analyses or present any evidence that behavioral thresholds are plateaued along that 30dB range].

      Since we do not have behavioural detection thresholds from our individual animals, only CAP thresholds that represent the auditory-nerve data and cannot be translated to behavioural thresholds directly, we want to refrain from using these indirect measures as covariates in the present analysis. In addition, the study by Hamann et al. (2002, https://doi.org/10.1016/S0378-5955(02)00454-9) indicates that age-related behavioural threshold increases are smaller than threshold increases obtained from auditory brainstem response measurements. Finally, statistical analyses on very small samples can be unreliable due to problems of power, generalisability, and susceptibility to outliers.

      Moore and Sek (2009) in their paper on the TFS1 test pointed out that the effect of signal level on the TFS1 threshold in normal hearing human subjects was small when the signal-to-noise ratio between the broadband masking noise and the complex tone was kept constant. Furthermore, the masking noise will raise the thresholds of normal hearing gerbils and old gerbils with an audibility threshold increase to about the same signal-to-noise ratio. Thus, as long as the signal remains audible to the behaviourally tested gerbil which can be expected at an overall signal level of 68 dB SPL, we expect little effect of raised audibility thresholds on the TFS1 threshold. The lack of temporal processing deficits in the auditory-nerve fibers of old, mildly hearing impaired gerbils compared to those in normal hearing young adult gerbils further strengthens this argument.

      Task learning in aged gerbils - It is unclear if the aged gerbils really learn the task well in two of the three TFS1 test conditions. The d' of 1 which is usually used as the criterion for learning was not reached in even the easiest condition for aged gerbils in all but one condition for the aged gerbils (Fig. 5H) and in that condition, there doesn't seem to be any age-related deficits in behavioral performance (Fig. 6B). Hence dissociating the inability to learn the task from the inability to perceive TFS 1 cues in those animals becomes challenging.

      [Update: The revised manuscript sufficiently addresses these issues, with the caveat of hearing threshold changes affecting behavioral thresholds mentioned above].

      As we argued above, an audibility threshold increase in the old gerbils is unlikely to explain the raised TFS1 thresholds in the old gerbils.

      Increased representation of periodicity envelope in the AN - the mechanisms for increased representation of periodicity envelope cues is unclear. The authors point to some potential central mechanisms but given that these are recordings from the auditory nerve what central mechanisms these may be is unclear. If the authors are suggesting some form of efferent modulation only at the f0 frequency, no evidence for this is presented. It appears more likely that the enhancement may be due to outer hair cell dysfunction (widened tuning, distorted tonotopy). Given this increased envelope coding, the potential change in sensation level for the behavior (from the comment above), and no change in neural coding of TFS cues across any of the groups, a simpler interpretation may be -TFS coding is not affected in remaining auditory nerve fibers after age-related or ouabain induced synapse loss, but behavioral performance is affected by altered outer hair cell dysfunction with age.

      [Update: The revised manuscript has addressed these issues]

      Emerging evidence seems to suggest that cochlear synaptopathy and/or TFS encoding abilities might be reflected in listening effort rather than behavioral performance. Measuring some proxy of listening effort in these gerbils (like reaction time) to see if that has changed with synapse loss, especially in the young animals with induced synaptopathy, would make an interesting addition to explore perceptual deficits of TFS coding with synapse loss.

      [Update: The revised manuscript has addressed these issues]

      Reviewer #3 (Recommendations for the authors):

      Thank you for your revisions. They largely address most of my initial concerns. The issue of threshold shifts potentially affecting behavioral thresholds still remains unresolved in my opinion. The new data about unaltered tuning curves is convincing that the auditory nerve fiber recordings are unaffected by threshold shifts. But am I correct in my understanding that the threshold shift with age was 36 dB relative to the young (L168)? If so, wouldn't the fact that behavior was performed at 68 dB SPL regardless of group affect the behavioral thresholds with age? Is there any additional evidence that suggests that behavioral performance plateaus along that ~30dB range that the authors could include to strengthen this claim?

      In our response above to reviewer #3 and to reviewer #2 we provided additional arguments why we think that an audibility threshold increase in old gerbils cannot explain their compromised TFS1 thresholds.


      The following is the authors’ response to the original reviews.

      Reviewer #1(Public review)  

      Summary:  

      The authors investigate the effects of aging on auditory system performance in understanding temporal fine structure (TFS), using both behavioral assessments and physiological recordings from the auditory periphery, specifically at the level of the auditory nerve. This dual approach aims to enhance understanding of the mechanisms underlying observed behavioral outcomes. The results indicate that aged animals exhibit deficits in behavioral tasks for distinguishing between harmonic and inharmonic sounds, which is a standard test for TFS coding. However, neural responses at the auditory nerve level do not show significant differences when compared to those in young, normalhearing animals. The authors suggest that these behavioral deficits in aged animals are likely attributable to dysfunctions in the central auditory system, potentially as a consequence of aging. To further investigate this hypothesis, the study includes an animal group with selective synaptic loss between inner hair cells and auditory nerve fibers, a condition known as cochlear synaptopathy (CS).CS is a pathology associated with aging and is thought to be an early indicator of hearing impairment. Interestingly, animals with selective CS showed physiological and behavioral TFS coding similar to that of the young normal-hearing group, contrasting with the aged group's deficits. Despite histological evidence of significant synaptic loss in the CS group, the study concludes that CS does not appear to affect TFS coding, either behaviorally or physiologically.  

      We agree with the reviewer’s summary.

      Strengths:  

      This study addresses a critical health concern, enhancing our understanding of mechanisms underlying age-related difficulties in speech intelligibility, even when audiometric thresholds are within normal limits. A major strength of this work is the comprehensive approach, integrating behavioral assessments, auditory nerve (AN) physiology, and histology within the same animal subjects. This approach enhances understanding of the mechanisms underlying the behavioral outcomes and provides confidence in the actual occurrence of synapse loss and its effects. The study carefully manages controlled conditions by including five distinct groups: young normal-hearing animals, aged animals, animals with CS induced through low and high doses, and a sham surgery group. This careful setup strengthens the study's reliability and allows for meaningful comparisons across conditions. Overall, the manuscript is well-structured, with clear and accessible writing that facilitates comprehension of complex concepts.

      Weaknesses:

      The stimulus and task employed in this study are very helpful for behavioral research, and using the same stimulus setup for physiology is advantageous for mechanistic comparisons. However, I have some concerns about the limitations in auditory nerve (AN) physiology. Due to practical constraints, it is not feasible to record from a large enough population of fibers that covers a full range of best frequencies (BFs) and spontaneous rates (SRs) within each animal. This raises questions about how representative the physiological data are for understanding the mechanism in behavioral data. I am curious about the authors' interpretation of how this stimulus setup might influence results compared to methods used by Kale and Heinz (2010), who adjusted harmonic frequencies based on the characteristic frequency (CF) of recorded units. While, the harmonic frequencies in this study are fixed across all CFs, meaning that many AN fibers may not be tuned closely to the stimulus frequencies. If units are not responsive to the stimulus further clarification on detecting mistuning and phase locking to TFS effects within this setup would be valuable. Since the harmonic frequencies in this study are fixed across all CFs, this means that many AN fibers may not be tuned closely to the stimulus frequencies, adding sampling variability to the results.

      We chose the stimuli for the AN recordings to be identical to the stimuli used in the behavioral evaluation of the perceptual sensitivity. Only with this approach can we directly compare the response of the population of AN fibers with perception measured in behavior.

      The stimuli are complex, i.e., comprise of many frequency components AND were presented at 68 dB SPL. Thus, the stimuli excite a given fiber within a large portion of the fiber’s receptive field. Furthermore, during recordings, we assured ourselves that fibers responded to the stimuli by audiovisual control. Otherwise it would have cost valuable recording time to record from a nonresponsive AN fiber.

      Given the limited number of units per condition-sometimes as few as three for certain conditions - I wonder if CF-dependent variability might impact the results of the AN data in this study and discussing this factor can help with better understanding the results. While the use of the same stimuli for both behavioral and physiological recordings is understandable, a discussion on how this choice affects interpretation would be beneficial. In addition a 60 dB stimulus could saturate high spontaneous rate (HSR) AN fibers, influencing neural coding and phase-locking to TFS. Potentially separating SR groups, could help address these issues and improve interpretive clarity.  

      A deeper discussion on the role of fiber spontaneous rate could also enhance the study. How might considering SR groups affect AN results related to TFS coding? While some statistical measures are included in the supplement, a more detailed discussion in the main text could help in interpretation.  We do not think that it will be necessary to conduct any statistical analysis in addition to that already reported in the supplement.  

      We considered moving some supplementary information back into the main manuscript but decided against it. Our single-unit sample was not sufficient, i.e. not all subpopulations of auditory-nerve fibers were sufficiently sampled for all animal treatment groups, to conclusively resolve every aspect that may be interesting to explore. The power of our approach lies in the direct linkage of several levels of investigation – cochlear synaptic morphology, single-unit representation and behavioral performance – and, in the main manuscript, we focus on the core question of synaptopathy and its relation to temporal fine structure perception. This is now spelled out clearly in lines 197 - 203 of the main manuscript.  

      Although Figure S2 indicates no change in median SR, the high-dose treatment group lacks LSR fibers, suggesting a different distribution based on SR for different animal groups, as seen in similar studies on other species. A histogram of these results would be informative, as LSR fiber loss with CS-whether induced by ouabain in gerbils or noise in other animals-is well documented (e.g., Furman et al., 2013).  

      Figure S2 was revised to avoid overlap of data points and show the distributions more clearly. Furthermore, the sample sizes for LSR and HSR fibers are now provided separately.

      Although ouabain effects on gerbils have been explored in previous studies, since these data already seems to be recorded for the animal in this study, a brief description of changes in auditory brainstem response (ABR) thresholds, wave 1 amplitudes, and tuning curves for animals with cochlear synaptopathy (CS) in this study would be beneficial. This would confirm that ouabain selectively affects synapses without impacting outer hair cells (OHCs). For aged animals, since ABR measurements were taken, comparing hearing differences between normal and aged groups could provide insights into the pathologies besides CS in aged animals. Additionally, examining subject variability in treatment effects on hearing and how this correlates with behavior and physiology would yield valuable insights. If limited space maybe a brief clarification or inclusion in supplementary could be good enough.  

      We thank the reviewer for this constructive suggestion. The requested data were added in a new section of the Results, entitled “Threshold sensitivity and frequency tuning were not affected by the synapse loss.” (lines 150 – 174). Our young-adult, ouabain-treated gerbils showed no significant elevations of CAP thresholds and their neural tuning was normal. Old gerbils showed the typical threshold losses for individuals of comparable age, and normal neural tuning, confirming previous reports. Thus, there was no evidence for relevant OHC impairments in any of our animal groups.   

      Another suggestion is to discuss the potential role of MOC efferent system and effect of anesthesia in reducing efferent effects in AN recordings. This is particularly relevant for aged animals, as CS might affect LSR fibers, potentially disrupting the medial olivocochlear (MOC) efferent pathway. Anesthesia could lessen MOC activity in both young and aged animals, potentially masking efferent effects that might be present in behavioral tasks. Young gerbils with functional efferent systems might perform better behaviorally, while aged gerbils with impaired MOC function due to CS might lack this advantage. A brief discussion on this aspect could potentially enhance mechanistic insights.  

      Thank you for this suggestion. The potential role of olivocochlear efferents is now discussed in lines 597 - 613.

      Lastly, although synapse counts did not differ between the low-dose treatment and NH I sham groups, separating these groups rather than combining them with the sham might reveal differences in behavior or AN results, particularly regarding the significance of differences between aged/treatment groups and the young normal-hearing group.  

      For maximizing statistical power, we combined those groups in the statistical analysis. These two groups did not differ in synapse number, threshold sensitivity or neural tuning bandwidths.

      Reviewer #2 (Public review):

      Summary:  

      Using a gerbil model, the authors tested the hypothesis that loss of synapses between sensory hair cells and auditory nerve fibers (which may occur due to noise exposure or aging) affects behavioral discrimination of the rapid temporal fluctuations of sounds. In contrast to previous suggestions in the literature, their results do not support this hypothesis; young animals treated with a compound that reduces the number of synapses did not show impaired discrimination compared to controls. Additionally, their results from older animals showing impaired discrimination suggest that agerelated changes aside from synaptopathy are responsible for the age-related decline in discrimination. 

      We agree with the reviewer’s summary.

      Strengths: 

      (1) The rationale and hypothesis are well-motivated and clearly presented. 

      (2) The study was well conducted with strong methodology for the most part, and good experimental control. The combination of physiological and behavioral techniques is powerful and informative. Reducing synapse counts fairly directly using ouabain is a cleaner design than using noise exposure or age (as in other studies), since these latter modifiers have additional effects on auditory function. 

      (3) The study may have a considerable impact on the field. The findings could have important implications for our understanding of cochlear synaptopathy, one of the most highly researched and potentially impactful developments in hearing science in the past fifteen years.  

      Weaknesses: 

      (1) My main concern is that the stimuli may not have been appropriate for assessing neural temporal coding behaviorally. Human studies using the same task employed a filter center frequency that was (at least) 11 times the fundamental frequency (Marmel et al., 2015; Moore and Sek, 2009). Moore and Sek wrote: "the default (recommended) value of the centre frequency is 11F0." Here, the center frequency was only 4 or 8 times the fundamental frequency (4F0 or 8F0). Hence, relative to harmonic frequency, the harmonic spacing was considerably greater in the present study. By my calculations, the masking noise used in the present study was also considerably lower in level relative to the harmonic complex than that used in the human studies. These factors may have allowed the animals to perform the task using cues based on the pattern of activity across the neural array (excitation pattern cues), rather than cues related to temporal neural coding. The authors show that mean neural driven rate did not change with frequency shift, but I don't understand the relevance of this. It is the change in response of individual fibers with characteristic frequencies near the lowest audible harmonic that is important here.  

      The auditory filter bandwidth of the gerbil is about double that of human subjects. Because of this, the masking noise has a larger overall level than in the human studies in the filter, prohibiting the use of distortion products. The larger auditory filter bandwidth precludes that the gerbils can use excitation patterns, especially in the condition with a center frequency of 1600 Hz and a fundamental of 200 Hz and in the condition with a center frequency of 3200 Hz and a fundamental of 400 Hz. In the condition with a center frequency of 1600 Hz and a fundamental of 400 Hz, it is possible that excitation patterns are exploited. We have now added  modeling of the excitation patterns, and a new figure showing their change at the gerbils’ perception threshold, in the discussion of the revised version (lines 440 - 446 and Fig. 8).

      The case against excitation pattern cues needs to be better made in the Discussion. It could be that gerbil frequency selectivity is broad enough for this not to be an issue, but more detail needs to be provided to make this argument. The authors should consider what is the lowest audible harmonic in each case for their stimuli, given the level of each harmonic and the level of the pink noise. Even for the 8F0 center frequency, the lowest audible harmonic may be as low as the 4th (possibly even the 3rd). In human, harmonics are thought to be resolvable by the cochlea up to at least the 8th.  

      This issue is now covered in the discussion, see response to the previous point.

      (2) The synapse reductions in the high ouabain and old groups were relatively small (mean of 19 synapses per hair cell compared to 23 in the young untreated group). In contrast, in some mouse models of the effects of noise exposure or age, a 50% reduction in synapses is observed, and in the human temporal bone study of Wu et al. (2021, https://doi.org/10.1523/JNEUROSCI.3238-20.2021) the age-related reduction in auditory nerve fibres was ~50% or greater for the highest age group across cochlear location. It could be simply that the synapse loss in the present study was too small to produce significant behavioral effects. Hence, although the authors provide evidence that in the gerbil model the age-related behavioral effects are not due to synaptopathy, this may not translate to other species (including human). This should be discussed in the manuscript. 

      We agree that our results apply to moderate synaptopathy, which predominantly characterizes early stages of hearing loss or aged individuals without confounding noise-induced cochlear damage. This is now discussed in lines 486 – 498.

      It would be informative to provide synapse counts separately for the animals who were tested behaviorally, to confirm that the pattern of loss across the group was the same as for the larger sample.  

      Yes, the pattern was the same for the subgroup of behaviorally tested animals. We have added this information to the revised version of the manuscript (lines 137 – 141).

      (3) The study was not pre-registered, and there was no a priori power calculation, so there is less confidence in replicability than could have been the case. Only three old animals were used in the behavioral study, which raises concerns about the reliability of comparisons involving this group.  

      The results for the three old subjects differed significantly from those of young subjects and young ouabain-treated subjects. This indicates a sufficient statistical power, since otherwise no significant differences would be observed.

      Reviewer #3 (Public review):

      This study is a part of the ongoing series of rigorous work from this group exploring neural coding deficits in the auditory nerve, and dissociating the effects of cochlear synaptopathy from other agerelated deficits. They have previously shown no evidence of phase-locking deficits in the remaining auditory nerve fibers in quiet-aged gerbils. Here, they study the effects of aging on the perception and neural coding of temporal fine structure cues in the same Mongolian gerbil model. 

      They measure TFS coding in the auditory nerve using the TFS1 task which uses a combination of harmonic and tone-shifted inharmonic tones which differ primarily in their TFS cues (and not the envelope). They then follow this up with a behavioral paradigm using the TFS1 task in these gerbils. They test young normal hearing gerbils, aged gerbils, and young gerbils with cochlear synaptopathy induced using the neurotoxin ouabain to mimic synapse losses seen with age. 

      In the behavioral paradigm, they find that aging is associated with decreased performance compared to the young gerbils, whereas young gerbils with similar levels of synapse loss do not show these deficits. When looking at the auditory nerve responses, they find no differences in neural coding of TFS cues across any of the groups. However, aged gerbils show an increase in the representation of periodicity envelope cues (around f0) compared to young gerbils or those with induced synapse loss. The authors hence conclude that synapse loss by itself doesn't seem to be important for distinguishing TFS cues, and rather the behavioral deficits with age are likely having to do with the misrepresented envelope cues instead.  

      We agree with the reviewer’s summary.

      The manuscript is well written, and the data presented are robust. Some of the points below will need to be considered while interpreting the results of the study, in its current form. These considerations are addressable if deemed necessary, with some additional analysis in future versions of the manuscript. 

      Spontaneous rates - Figure S2 shows no differences in median spontaneous rates across groups. But taking the median glosses over some of the nuances there. Ouabain (in the Bourien study) famously affects low spont rates first, and at a higher degree than median or high spont rates. It seems to be the case (qualitatively) in Figure S2 as well, with almost no units in the low spont region in the ouabain group, compared to the other groups. Looking at distributions within each spont rate category and comparing differences across the groups might reveal some of the underlying causes for these changes. Given that overall, the study reports that low-SR fibers had a higher ENV/TFS log-zratio, the distribution of these fibers across groups may reveal specific effects of TFS coding by group.  

      As the reviewer points out, our sample from the group treated with a high concentration of ouabain showed very few low-spontaneous-rate auditory-nerve fibers, as expected from previous work. However, this was also true, e.g., for our sample from sham-operated animals, and may thus well reflect a sampling bias. We are therefore reluctant to attach much significance to these data distributions. We now point out more clearly the limitations of our auditory-nerve sample for the exploration of  interesting questions beyond our core research aim (see also response to Reviewer 1 above).  

      Threshold shifts - It is unclear from the current version if the older gerbils have changes in hearing thresholds, and whether those changes may be affecting behavioral thresholds. The behavioral stimuli appear to have been presented at a fixed sound level for both young and aged gerbils, similar to the single unit recordings. Hence, age-related differences in behavior may have been due to changes in relative sensation level. Approaches such as using hearing thresholds as covariates in the analysis will help explore if older gerbils still show behavioral deficits.  

      Unfortunately, we did not obtain behavioral thresholds that could be used here. We want to point out that the TFS 1 stimuli had an overall level of 68 dB SPL, and the pink noise masker would have increased the threshold more than expected from the moderate, age-related hearing loss in quiet. Thus, the masked thresholds for all gerbil groups are likely similar and should have no effect on the behavioral results.

      Task learning in aged gerbils - It is unclear if the aged gerbils really learn the task well in two of the three TFS1 test conditions. The d' of 1 which is usually used as the criterion for learning was not reached in even the easiest condition for aged gerbils in all but one condition for the aged gerbils (Fig. 5H) and in that condition, there doesn't seem to be any age-related deficits in behavioral performance (Fig. 6B). Hence dissociating the inability to learn the task from the inability to perceive TFS 1 cues in those animals becomes challenging.  

      Even in the group of gerbils with the lowest sensitivity, for the condition 400/1600 the animals achieved a d’ of on average above 1. Furthermore, stimuli were well above threshold and audible, even when no discrimination could be observed. Finally, as explained in the methods, different stimulus conditions were interleaved in each session, providing stimuli that were easy to discriminate together with those being difficult to discriminate. This approach ensures that the gerbils were under stimulus control, meaning properly trained to perform the task. Thus, an inability to discriminate does not indicate a lack of proper training.  

      Increased representation of periodicity envelope in the AN - the mechanisms for increased representation of periodicity envelope cues is unclear. The authors point to some potential central mechanisms but given that these are recordings from the auditory nerve what central mechanisms these may be is unclear. If the authors are suggesting some form of efferent modulation only at the f0 frequency, no evidence for this is presented. It appears more likely that the enhancement may be due to outer hair cell dysfunction (widened tuning, distorted tonotopy). Given this increased envelope coding, the potential change in sensation level for the behavior (from the comment above), and no change in neural coding of TFS cues across any of the groups, a simpler interpretation may be -TFS coding is not affected in remaining auditory nerve fibers after age-related or ouabain induced synapse loss, but behavioral performance is affected by altered outer hair cell dysfunction with age. 

      A similar point was made by Reviewer #1. As indicated above, new data on threshold sensitivity and neural tuning were added in a new section of the Results which indirectly suggest that significant OHC pathologies were not a concern, neither in our young-adult, synaptopathic gerbils nor in the old gerbils.  

      Emerging evidence seems to suggest that cochlear synaptopathy and/or TFS encoding abilities might be reflected in listening effort rather than behavioral performance. Measuring some proxy of listening effort in these gerbils (like reaction time) to see if that has changed with synapse loss, especially in the young animals with induced synaptopathy, would make an interesting addition to explore perceptual deficits of TFS coding with synapse loss.  

      This is an interesting suggestion that we now explore in the revision of the manuscript. Reaction times can be used as a proxy for listening effort and were recorded for all responses. The the new analysis now reported in lines 378 - 396 compared young-adult control gerbils with young-adult gerbils that had been treated with the high concentration of ouabain. No differences in response latencies was found, indicating that listening effort did not change with synapse loss.  

      Reviewer #1 (Recommendations for the authors): 

      Figure 2: The y-axis labeled as "Frequency" is potentially misleading since there are additional frequency values on the right side of the panels. It would be helpful to clarify more in the caption what these right-side frequency values represent. Additionally, the legend could be positioned more effectively for clarity.

      Thank you for your suggestion. The axis label was rephrased.

      Figure 7: This figure is a bit unclear, as it appears to show two sets of gerbil data at 1500 Hz, yet the difference between them is not explained.  

      We added the following text to the figure legend: „The higher and lower thresholds shown for the gerbil data reflect thresholds at  fc of 1600 Hz for fundamentals f0 of 200 Hz and 400 Hz, respectively.“

      Maybe a short description of fmax that is used in Figure 4 could help or at least point to supplementary for finding the definition.  

      We thank the reviewer for pointing out this typo/inaccuracy. The correct terminology in line with the remainder of the manuscript is “fmaxpeak”. We corrected the caption of figure 5 (previously figure 4) and added the reference pointing to figure 11 (previously figure 9), which explains the terms.

      I couldn't find information about the possible availability of data. 

      The auditory-nerve recordings reported in this paper are part of a larger study of single-unit auditorynerve responses in gerbils, formally described and published by Heeringa (2024) Single-unit data for sensory neuroscience: Responses from the auditory nerve of young-adult and aging gerbils. Scientific Data 11:411, https://doi.org/10.1038/s41597-024-03259-3. As soon as the Version of Record will be submitted, the raw single-unit data can be accessed directly through the following link:  https://doi.org/10.5061/dryad.qv9s4mwn4. The data that are presented in the figures of the present manuscript and were statistically analyzed are uploaded to the Zenodo repository (https://doi.org/10.5281/zenodo.15546625).  

      Reviewer #2 (Recommendations for the authors): 

      L22. The term "hidden hearing loss" is used in many different ways in the literature, from being synonymous with cochlear synaptopathy, to being a description of any listening difficulties that are not accounted for by the audiogram (for which there are many other / older terms). The original usage was much more narrow than your definition here. It is not correct that Schaette and McAlpine defined HHL in the broad sense, as you imply. I suggest you avoid the term to prevent further confusion.  

      We eliminated the term hidden hearing loss.

      L43. SNHL is undefined.

      Thank you for catching that. The term is now spelled out.

      L64. "whether" -> "that"  

      We corrected this issue.

      L102. It would be informative to see the synapse counts (across groups) for the animals tested in the behavioral part of the study. Did these vary between groups in the same way?  

      Yes, the pattern was the same for the subgroup of behaviorally tested animals. We have added this information to the revised version of the manuscript (lines 137 – 141).

      L108. How many tests were considered in the Bonferroni correction? Did this cover all reported tests in the paper?  

      The comparisons of synapse numbers between treatment groups were done with full Bonferroni correction, as in the other tests involving posthoc pair-wise comparisons after an ANOVA.

      Figure 1 and 6 captions. Explain meaning of * and ** (criteria values).  

      The information was added to the figure legends of now Figs. 1 and 7. 

      L139. I don't follow the argument - the mean driven rate is not important. It is the rate at individual CFs and how that changes with frequency shift that provides the cue.

      L142. I don't follow - individual driven rates might have been a cue (some going up, some down, as frequency was shifted).  

      Yes, theoretically it is possible that the spectral pattern of driven rates (i.e., excitation pattern) can be specifically used for profile analysis and subsequently as a strong cue for discriminating the TFS1 stimuli. In order to shed some light on this question with regard to the actual stimuli used in this study, we added a comprehensive figure showing simulated excitation patterns (figure 8). The excitation patterns were generated with a gammatone filter bank and auditory filter bandwidths appropriate for gerbils (Kittel et al. 2002). The simulated excitation patterns allow to draw some at least semi-quantitative conclusions about the possibility of profile analysis: 1. In the 200/1600 Hz and 400/3200 Hz conditions (i.e., harmonic number of fc is 8), the difference between all inharmonic excitation patterns and the harmonic reference excitation pattern is far below the threshold for intensity discrimination (Sinnott et al. 1992). 2. In the same conditions, the statistics of the pink noise make excitation patterns differences at or beyond the filter slopes (on both high and low frequency limits) useless for frequency shift discrimination. 3. In the 400/1600 Hz condition (i.e., harmonic number of fc is 4), there is a non-negligible possibility that excitation pattern differences were a main cue for discrimination. All of these conclusions are compatible with the results of our study.

      L193. Is this p-value Bonferroni corrected across the whole study? If not, the finding could well be spurious given the number of tests reported.  

      Yes, it is Bonferroni corrected

      L330. TFS is already defined.  

      L346. AN is already defined.  

      L408. "temporal fine structure" -> "TFS"  

      It was a deliberate decision to define these terms again in the Discussion, for readers who prefer to skip most of the detailed Results. 

      L364-366. This argument is somewhat misleading. Cochlear resolvability largely depends on the harmonic spacing (i.e., F0) relative to harmonic frequency (in other words, on harmonic rank). Marmel et al. (2015) and Moore and Sek (2009) used a center frequency (at least) 11 times F0. Here, the center frequency was only 4 or 8 times F0. In human, this would not be sufficient to eliminate excitation pattern cues.  

      We have now included results from modeling the excitation patterns in the discussion with a new figure demonstrating that at a center frequency of 8 times F0, excitation patterns provide no useful cue while this is a possibility at  a center frequency of 4 times F0 (Fig. 8, lines 440 - 446).

      L541. Was that a spectrum level of 20 dB SPL (level per 1-Hz wide band) at 1 kHz? Need to clarify.  

      The power spectral density of the pink noise at 1 kHz (i.e., the level in a 1 Hz wide band centered at 1 kHz) was 13.3 dB SPL. The total level of the pink noise (including edge filters at 100 Hz and 11 kHz) was 50 dB SPL.

      L919. So was the correction applied across only the tests within each ANOVA? Don't you need to control the study-wise error rate (across all primary tests) to avoid spurious findings?  

      We added information about the family-wise error rate (line 1077 - 1078). Since the ANOVAs tested different specific research questions, we do not think that we need to control the study-wise error rate.

      Reviewer #3 (Recommendations for the authors): 

      There was no difference in TFS sensitivity in the AN fiber activity across all the groups. Potential deficits with age were only sound in the behavioral paradigm. Given that, it might make it clearer to specify that the deficits or lack thereof are in behavior, in multiple instances in the manuscript where it says synaptopathy showed no decline in TFS sensitivity (For example Line 342-344).  

      We carefully went through the entire text and clarified a couple more instances.

      L353 - this statement is a bit too strong. It implies causality when there is only a co-occurrence of increased f0 representation and age-related behavioral deficits in TFS1 task.  

      The statement was rephrased as “Thus, cue representation may be associated with the perceptual deficits, but not reduced synapse numbers, as originally proposed.”

      L465-467 - while this may be true, I think it is hard to say this with the current dataset where only AN fibers are being recorded from. I don't think we can say anything about afferent central mechanisms with this data set.  

      We agree. However, we refer here to published data on central inhibition to provide a possible explanation. 

      Hearing thresholds with ABRs are mentioned in the methods, but that data is not presented anywhere. Would be nice to see hearing thresholds across the various groups to account or discount outer hair cell dysfunction. 

      This important point was made repeatedly and we thank the Reviewers for it. As indicated above, new data on threshold sensitivity and neural tuning were added in a new section of the Results which indirectly suggest that significant OHC pathologies were not a concern, neither in our young-adult, synaptopathic gerbils nor in the old gerbils.

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      his valuable study presents a theoretical model of how punctuated mutations influence multistep adaptation, supported by empirical evidence from some TCGA cancer cohorts. This solid model is noteworthy for cancer researchers as it points to the case for possible punctuated evolution rather than gradual genomic change. However, the parametrization and systematic evaluation of the theoretical framework in the context of tumor evolution remain incomplete, and alternative explanations for the empirical observations are still plausible.

      We thank the editor and the reviewers for their thorough engagement with our work. The reviewers’ comments have drawn our attention to several important points that we have addressed in the updated version. We believe that these modifications have substantially improved our paper.

      There were two major themes in the reviewers’ suggestions for improvement. The first was that we should demonstrate more concretely how the results in the theoretical/stylized modelling parts of our paper quantitatively relate to dynamics in cancer.

      To this end, we have now included a comprehensive quantification of the effect sizes of our results across large and biologically-relevant parameter ranges. Specifically, following reviewer 1’s suggestion to give more prominence to the branching process, we have added two figures (Fig S3-S4) quantifying the likelihood of multi-step adaptation in a branching process for a large range of mutation rates and birth-death ratios. Formulating our results in terms of birth-death ratios also allowed us to provide better intuition regarding how our results manifest in models with constant population size vs models of growing populations. In particular, the added figure (Fig S3) highlights that the effect size of temporal clustering on the probability of successful 2-step adaptation is very sensitive to the probability that the lineage of the first mutant would go extinct if it did not acquire a second mutation. As a result, the phenomenon we describe is biologically likely to be most effective in those phases during tumor evolution in which tumor growth is constrained. This important pattern had not been described sufficiently clearly in the initial version of our manuscript, and we thank both reviewers for their suggestions to make these improvements.

      The second major theme in the reviewers’ suggestions was focused on how we relate our theoretical findings to readouts in genomic data, with both reviewers pointing to potential alternative explanations for the empirical patterns we describe.

      We have now extended our empirical analyses following some of the reviewers’ suggestions. Specifically, we have included analyses investigating how the contribution of reactive oxygen species (ROS)-related mutation signatures correlates with our proxies for multi-step adaptation; and we have included robustness checks in which we use Spearman instead of Pearson correlations. Moreover, we have included more discussion on potential confounds and the assumptions going into our empirical analyses as well as the challenges in empirically identifying the phenomena we describe.

      Below, we respond in detail to the individual comments made by each reviewer.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Grasper et al. present a combined analysis of the role of temporal mutagenesis in cancer, which includes both theoretical investigation and empirical analysis of point mutations in TCGA cancer patient cohorts. They find that temporally elevated mutation rates contribute to cancer fitness by allowing fast adaptation when the fitness drops (due to previous deleterious mutations). This may be relevant in the case of tumor suppressor genes (TSG), which follow the 2-hit hypothesis (i.e., biallelic 2 mutations are necessary to deactivate TS), and in cases where temporal mutagenesis occurs (e.g., high APOBEC, ROS). They provide evidence that this scenario is likely to occur in patients with some cancer types. This is an interesting and potentially important result that merits the attention of the target audience. Nonetheless, I have some questions (detailed below) regarding the design of the study, the tools and parametrization of the theoretical analysis, and the empirical analysis, which I think, if addressed, would make the paper more solid and the conclusion more substantiated.

      Strengths:

      Combined theoretical investigation with empirical analysis of cancer patients.

      Weaknesses:

      Parametrization and systematic investigation of theoretical tools and their relevance to tumor evolution.

      We sincerely thank Reviewer 1 for their comments. As communicated in more detail in the point-by-point replies to the “Recommendations for the authors”, we have revised the paper to address these comments in various ways. To summarize, Reviewer 1 asked for (1) more comprehensive analyses of the parameter space, especially in ranges of small fitness effects and low mutation rates; (2) additional clarifications on details of mechanisms described in the manuscript; and (3) suggested further robustness checks to our empirical analyses. We have addressed these points as follows: we have added detailed analyses of dynamics and effect sizes for branching processes (see Sections SI2 and SI3 in the Supplementary Information, as well as Figures S3 and S4). As suggested, these additions provide characterizations of effect sizes in biologically relevant parameter ranges (low mutation rates and smaller fitness effect sizes), and extend our descriptions to processes with dynamically changing population sizes. Moreover, we have added further clarifications at suggested points in the manuscript, e.g. to elaborate on the non-monotonicities in Fig 3. Lastly, we have undertaken robustness checks using Spearman rather than Pearson correlation coefficients to quantify relations between TSG deactivation and APOBEC signature contribution, and have performed analyses investigating dynamics of reactive oxygen species-associated mutagenesis instead of APOBEC.

      Reviewer #2 (Public review):

      This work presents theoretical results concerning the effect of punctuated mutation on multistep adaptation and empirical evidence for that effect in cancer. The empirical results seem to agree with the theoretical predictions. However, it is not clear how strong the effect should be on theoretical grounds, and there are other plausible explanations for the empirical observations.

      Thank you very much for these comments. We have now substantially expanded our investigations of the parameter space as outlined in the response to the “eLife Assessment” above and in the detailed comments below (A(1)-A(3)) to convey more quantitative intuition for the magnitude of the effects we describe for different phases of tumor evolution. We agree that there could be potential additional confounders to our empirical investigations besides the challenges regarding quantification that we already described in our initial version of the manuscript. We have thus included further discussion of these in our manuscript (see replies to B(1)-B(3)), and we have expanded our empirical analyses as outlined in the response to the “eLife Assessment”.

      For various reasons, the effect of punctuated mutation may be weaker than suggested by the theoretical and empirical analyses:

      (A1) The effect of punctuated mutation is much stronger when the first mutation of a two-step adaptation is deleterious (Figure 2). For double inactivation of a TSG, the first mutation--inactivation of one copy--would be expected to be neutral or slightly advantageous. The simulations depicted in Figure 4, which are supposed to demonstrate the expected effect for TSGs, assume that the first mutation is quite deleterious. This assumption seems inappropriate for TSGs, and perhaps the other synergistic pairs considered, and exaggerates the expected effects.

      Thank you for highlighting this discrepancy between Figure 2 and Figure 4. For computational efficiency and for illustration purposes, we had opted for high mutation rates and large fitness effects in Figure 2; however, our results are valid even in the setting of lower mutation rates and fitness effects. To improve the connection to Figure 4, and to address other related comments regarding parameter dependencies, we have now added more detailed quantification of the effects we describe (Figures SF3 and SF4) to the revised manuscript. These additions show that the effects illustrated in Figure 2 retain large effect sizes when going to much lower mutation rates and much smaller fitness effects. Indeed, while under high mutation rates we only see the large relative effects if the first mutation is highly deleterious, these large effects become more universal when going to low mutation rates.

      In general, it is correct that the selective disadvantage (or advantage) conveyed by the first mutation affects the likelihood of successful 2-step adaptations. It is also correct that the magnitude of the ‘relative effect’ of temporal clustering on valley-crossing is highest if the lineage with only the first of the two mutations is vanishingly unlikely to produce a second mutant before going extinct. If the first mutation is strongly deleterious, the lineage of such a first mutant is likely to quickly go extinct – and therefore also more likely to do so before producing a second mutant.

      However, this likelihood of producing the second mutant is also low if the mutation rate is low. As our added figure (Figure SF3) illustrates, at low mutation rates appropriate for cancer cells, is insensitive to the magnitude of the fitness disadvantage for large parts of the parameter space. Especially in populations of constant size (approximated by a birth/death ratio of 1), the relative effects for first mutations that reduce the birth rate by 0.5 or by 0.05 are indistinguishable (Figure SF3f).

      Moreover, the absolute effect , as we discuss in the paper (Figures SF2 and SF3) is largest in regions of the parameter space in which the first mutant is not infinitesimally unlikely to produce a second mutant (and 𝑓<sub>𝑘</sub> and 𝑓<sub>1</sub> would be infinitesimally small), but rather in parameter regions in which this first mutant has a non-negligible chance to produce a second mutant. The absolute effect therefore peaks around fitness-neutral first mutations. While the next comment (below) says that our empirical investigations more closely resemble comparisons of relative effects and not absolute effects, we would expect that the observations in our data come preferentially from multi-step adaptations with large absolute effect since the absolute effect is maximal when both 𝑓<sub>𝑘</sub> and 𝑓<sub>1</sub>are relatively high.

      In summary, we believe Figure 2, while having exaggerated parameters for very defendable reasons, is not a misleading illustration of the general phenomenon or of its applicability in biological settings, as effect sizes remain large when moving to biologically realistic parameter ranges. To clarify this issue, we have largely rewritten the relevant paragraphs in the results section and have added two additional figures (Figures SF3 and SF4) as well as a section in the SI with detailed discussion (SI2).

      (A2) More generally, parameter values affect the magnitude of the effect. The authors note, for example, that the relative effect decreases with mutation rate. They suggest that the absolute effect, which increases, is more important, but the relative effect seems more relevant and is what is assessed empirically.

      Thank you for this comment. As noted in the replies to the above comments, we have now included extensive investigations of how sensitive effect sizes are to different parameter choices. We also apologize for insufficiently clearly communicating how the quantities in Figure 4 relate to the findings of our theoretical models.

      The challenge in relating our results to single-timepoint sequencing data is that we only observe the mutations that a tumor has acquired, but we do not directly observe the mutation rate histories that brought about these mutations. As an alternative readout, we therefore consider (through rough proxies: TSGs and APOBEC signatures) the amount of 2-step adaptations per acquired/retained mutation. While we unfortunately cannot control for the average mutation rate in a sample, we motivate using this “TSG-deactivation score” by the hypothesis that for any given mutation rate, we expect a positive relationship between the amount of temporal clustering and the amount of 2-step adaptations per acquired/retained mutation. This hypothesis follows directly from our theoretical model where it formally translates to the statement that for a fixed , is increasing in .

      However, while both quantities 𝑓<sub>𝑘</sub>/𝑓<sub>1</sub>  or from our theoretical model relate to this hypothesis – both are increasing in 𝑘–, neither of them maps directly onto the formulation of our empirical hypothesis.

      We have now rewritten the relevant passages of the manuscript to more clearly convey our motivation for constructing our TSG deactivation score in this form (P. 4-6).

      (A3) Routes to inactivation of both copies of a TSG that are not accelerated by punctuation will dilute any effects of punctuation. An example is a single somatic mutation followed by loss of heterozygosity. Such mechanisms are not included in the theoretical analysis nor assessed empirically. If, for example, 90% of double inactivations were the result of such mechanisms with a constant mutation rate, a factor of two effect of punctuated mutagenesis would increase the overall rate by only 10%. Consideration of the rate of apparent inactivation of just one TSG copy and of deletion of both copies would shed some light on the importance of this consideration.

      This is a very good point, thank you. In our empirical analyses, the main motivation was to investigate whether we would observe patterns that are qualitatively consistent with our theoretical predictions, i.e. whether we would find positive associations between valley-crossing and temporal clustering. Our aim in the empirical analyses was not to provide a quantitative estimate of how strongly temporally clustered mutation processes affect mutation accumulation in human cancers. We hence restricted attention to only one mutation process which is well characterized to be temporally clustered (APOBEC mutagenesis) and to only one category of (epi)genomic changes (SNPs, in which APOBEC signatures are well characterized). Of course, such an analysis ignores that other mutation processes (e.g. LOH, copy number changes, methylation in promoter regions, etc.) may interact with the mechanisms that we consider in deactivating Tumor suppressor genes.

      We have now updated the text to include further discussion of this limitation and further elaboration to convey that our empirical analyses are not intended as a complete quantification of the effect of temporal clustering on mutagenesis in-vivo (P. 10,11).

      Several factors besides the effects of punctuated mutation might explain or contribute to the empirical observations:

      (B1) High APOBEC3 activity can select for inactivation of TSGs (references in Butler and Banday 2023, PMID 36978147). This selective force is another plausible explanation for the empirical observations.

      Thank you for making this point. We agree that increased APOBEC3 activity, or any other similar perturbation, can change the fitness effect that any further changes/perturbations to the cell would bring about. Our empirical analyses therefore rely on the assumption that there are no major confounding structural differences in selection pressures between tumors with different levels of APOBEC signature contributions. We have expanded our discussion section to elaborate on this potential limitation (P. 10-11).

      While the hypothesis that APOBEC3 activity selects for inactivation of TSGSs has been suggested, there remain other explanations. Either way, the ways in which selective pressures have been suggested to change would not interfere relevantly with the effects we describe. The paper cited in the comment argues that “high APOBEC3 activity may generate a selective pressure favoring” TSG mutations as “APOBEC creates a high [mutation] burden, so cells with impaired DNA damage response (DDR) due to tumor suppressor mutations are more likely to avert apoptosis and continue proliferating”. To motivate this reasoning, in the same passage, the authors cite a high prevalence of TP53 mutations across several cancer types with “high burden of APOBEC3-induced mutations”, but also note that “this trend could arise from higher APOBEC3 expression in p53-mutated tumors since p53 may suppress APOBEC3B transcription via p21 and DREAM proteins”.

      Translated to our theoretical framework, this reasoning builds on the idea that APOBEC3 activity increases the selective advantage of mutants with inactivation of both copies of a TSG. In contrast, the mechanism we describe acts by altering the chances of mutants with only one TSG allele inactivated to inactivate the second allele before going extinct. If homozygous inactivation of TSGs generally conveys relatively strong fitness advantages, lineages with homozygous inactivation would already be unlikely to go extinct. Further increasing the fitness advantage of such lineages would thus manifest mostly in a quicker spread of these lineages, rather than in changes in the chance that these lineages survive. In turn, such a change would have limited effect on the “rate” at which such 2-step adaptations occur, but would mostly affect the speed at which they fixate. It would be interesting to investigate these effects empirically by quantifying the speed of proliferation and chance of going extinct for lineages that newly acquired inactivating mutations in TSGs.

      Beyond this explicit mention of selection pressures, the cited paper also discusses high occurrences of mutations in TSGs in relation to APOBEC. These enrichments, however, are not uniquely explained by an APOBEC-driven change in selection pressures. Indeed, our analyses would also predict such enrichments.

      (B2) Without punctuation, the rate of multistep adaptation is expected to rise more than linearly with mutation rate. Thus, if APOBEC signatures are correlated with a high mutation rate due to the action of APOBEC, this alone could explain the correlation with TSG inactivation.

      Thank you for making this point. Indeed, an identifying assumption that we make is that average mutation rates are balanced between samples with a higher vs lower APOBEC signature contribution. We cannot cleanly test this assumption, as we only observe aggregate mutation counts but not mutation rates. However, the fact that we observe an enrichment for APOBEC-associated mutations among the set of TSG-inactivating mutations (see Figure 4F) would be consistent with APOBEC-mutations driving the correlations in Fig 4D, rather than just average mutation rates. We have now added a paragraph to our manuscript to discuss these points (P. 10-11).

      (B3) The nature of mutations caused by APOBEC might explain the results. Notably, one of the two APOBEC mutation signatures, SBS13, is particularly likely to produce nonsense mutations. The authors count both nonsense and missense mutations, but nonsense mutations are more likely to inactivate the gene, and hence to be selected.

      Thank you for making this point.  We have included it in our discussion of potential confounders/limitations in the revised manuscript (P. 10-11).  

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Specific questions/comments/suggestions:

      (1) For the theoretical investigation, the authors use the Wright-Fisher model with specific parameters for the decrease/increase in the fitness (0.5,1.5). This model is not so relevant to cancer, because it assumes a constant population size, while in cancer, the population is dynamic (increasing, if the tumor grows). Although I see they mention relevance to the branching process (in SI), I think the branching process should be bold in the main text and the Wright-Fisher in SI (or even dropped).

      Thank you for this comment. We agree that too little attention had been given to the branching process in the original version of our manuscript. While the Wright-Fisher process is computationally efficient to simulate and thus lends itself to clean simulations for illustrative examples, it did lead us to put undue emphasis on populations of constant size.

      The added Figures SF2 and SF3 now focus on branching processes, and we have substantially expanded our discussion of how dynamics differ as a function of the population-size trajectory (constant vs growing; SI2, P. 4,9,10). Generally, we do believe that it is appropriate to consider both regimes. If tumors evolve from being confined within their site of origin to progressively invading adjacent tissues and organ compartments, they traverse different regions of the birth-death ratio parameter space. Moreover, the timing of transitions between phases of more or less constrained growth is likely closely tied to adaptation dynamics, since breaching barriers to expansion requires adapting to novel environments and selection pressures.

      We hope that the revised version of the manuscript conveys these points more clearly, and thank you for alerting us to this imbalance in the original version of our manuscript.

      (2) The parameters 0.5 (decrease in fitness) and 1.5 (increase in fitness) seem exaggerated (the typical values for the selective advantage are usually much lower (by an order of magnitude). The same goes for the mutation rate. The authors chose values of the order 0.001, while in cancer (and generally) it is much lower than that (10-5 - 10-6). I think that generally, the authors should present a more systematic analysis of the sensitivity of the results to these parameters.

      Thank you very much for this very important comment. We have made this a major focus in our revisions (see our reply to the editor’s comments). As suggested, we have now added further analyses to explore more biologically relevant parameter regimes. Reviewer 2 has made a similar remark, and to avoid redundancies, we point for a more detailed response to our response to that comment (A1).

      (3) In Figure 3, the authors explore the sensitivity to mu (mutation rate) and k (temporal clustering) and find a non-monotonic behavior (Figure 3C). However, this behavior is not well explained. I think some more explanations are required here.

      Thank you for pointing this out. We had initially relegated the more detailed explanations to the SI2 (which in the revised manuscript became SI4), but are happy to provide more elaboration in the main text, and have done so now (P. 5).

      For , the non-monotonicity reflects the exploration-exploitation tradeoff that this section is dedicated to very small  values (little exploration) prevent the population from finding fitness peaks. In contrast, once a fitness peak is reached, excessively large  values (little exploitation) scatter the population away from this peak to points of lower fitness.

      For , the most relevant dynamic is that at high , the population becomes unable to find close-by fitness improvements (1-step adaptations) if it is not in a burst. As 𝑘 increases, this delay in adaptation (until a burst occurs) eventually comes to outweigh the benefits of high 𝑘 (better ability to undergo multi-step adaptations). Additionally, if 𝑘 ∙ μ becomes very large, clonal interference eventually leads to diminishing exploration-returns when 𝑘 is increased further (Fig 5C), as the per-cell likelihood of finding a specific fitness peak eventually saturates and increasing  only causes multiple cells to find the same peak, rather than one cell finding this peak and its lineage fixating in the population.

      (4) In Figure 5, where the authors show the accumulation of the first (red; deleterious mutation) and second (blue; advantageous mutation), it seems that the fraction of deleterious mutations is much lower than that of advantageous mutations. This is opposite to the case of cancer, where most of the mutations are 'passengers', (slightly) deleterious or neutral mutations. Can the author explain this discrepancy and generally the relation of their parametrization to deleterious vs. advantageous mutations?

      Thank you for this comment. In general, we have focused attention in our paper on sequences of mutations that bring about a fitness increase. We call those sequences ‘adaptations’ and categorize these as one-step or multi-step, depending on whether or not they contain intermediates states with a fitness disadvantage.

      In our modelling, we do not consider mutations that are simply deleterious and are not a necessary part of a multi-step adaptation sequence. The motivation for this abstraction is, firstly, to focus on adaptation dynamics, and secondly, that in certain limits (small mu and large constant population sizes), lineages with only deleterious mutations have a probability close to one of going extinct, so that any emerging deleterious mutant would likely be 'washed out’ of the population before a new mutation emerges.

      However, whether the dynamics of how neutral or deleterious passenger mutations are acquired also vary relevantly with the extent of temporal clustering is a valid and interesting question that would warrant its own study. The types of theoretical arguments for such an investigation would be very similar to the ones we use in our paper.

      (5) The theoretical investigation assumes a multi/2-step adaptation scenario where the first mutation is deleterious and the second is advantageous. I think this should be generalized and further explored. For example, what happens when there are multiple mutations that are slightly deleterious (as probably is the case in cancer) and only much later mutations confer a selective advantage? How stable is the "valley crossing" if more deleterious mutations occur after the 2 steps?

      This is also an important point and relates in part to the previous comment (4).  For discussion of interactions with deleterious mutations, please see the reply to comment (4).  

      Regarding generalizations of this valley-crossing scenario, note that any sequence of mutations that increases fitness can be decomposed into sequences of either one-step or multi-step adaptations, as defined  in the paper. Therefore, if all intermediate states before the final selectively advantageous state have a selective disadvantage making the lineages of such cells likely to go extinct, then our derivations in S1 apply, and the relative effect of temporal clustering becomes where n is the number of intermediate states. If, conversely, any of the intermediate states already had a selective advantage, then our model would consider the subsequence until this first mutation with a selective advantage as its individual (one-step or multi-step) “adaptation”.

      The second question, “How stable is the "valley crossing" if more deleterious mutations occur after the 2 steps?”, touches on a different property of the population dynamics, namely on how the fate of a mutant lineage depends on how this lineage emerged. In our paper, we compare different levels of temporal clustering for a fixed average mutation rate. This choice implies that, if we assume that the mutant that emerges from a valley-crossing does not go extinct, then the number of deleterious mutations expected to occur in this lineage, once emerged, will not depend on the extent of temporal clustering. However, if in-burst mutation rates increased the expected burden of early acquired deleterious mutations sufficiently much to affect the probability that the lineage with a multi-step adaptation goes extinct before the burst ends, then there may indeed be an interaction between effects of deleterious passengers and temporal clustering. We would, however, expect effects on this probability of early extinction to be relatively minor, since such a lineage with a selective advantage would quickly grow to large cell-numbers implying that it would require a large number of co-occurring and sufficiently deleterious mutations across these cells for the lineage to go extinct.

      (6) For the empirical analysis of TCGA cohorts, the authors focus on the contribution of APOBEC mutations (via signature analysis) to temporal mutagenesis. They find only a few cancer types (Figure 4D) that follow their prediction (in Figure 4C) of a correlation between TSG deactivation and temporal mutations in bursts. I think two main points should be addressed:

      Thank you for this comment. We will respond in detail to the corresponding points below, but would like to note here that while we find this correlation “in only a few cancer types”, we also show that only few cancer types have relevant proportions of mutations caused by APOBEC, and it is precisely in these cancer types that we find a correlation.  We have clarified this aspect in the revised version of the manuscript (P.7).

      (i) APOBEC is not the only cause for temporal mutagenesis. For example, elevated ROS and hypoxia are also potential contributors - it might therefore be important to extend the signature analysis (to include more possible sources for temporal mutagenesis). Potentially, such an extension may show that more cancer types follow the author's prediction.

      Thank you for this interesting suggestion. We have now included analogous analyses for contributions of signature SBS18 which is associated with ROS mutagenesis, and for the joint contribution of signatures SBS17a, SBS17b, SBS18 and SBS36, which all have been shown (some in a more context-dependent manner) to be associated with ROS mutagenesis. When doing so, we do not find a clear trend. However, we also do not find these signatures to account for substantial proportions of the acquired mutations, meaning that ROS mutagenesis likely also does not account for much of the variation in how temporally clustered the mutation rate trajectories of different tumors are. We have incorporated these results and their discussion in the manuscript (SI5 and Fig S8).

      (ii) The TSG deactivation score used by the authors only counts the number of mutations and does not consider if the 2 mutations are biallelic, which is highly important in this case. There are ways to investigate the specific allele of mutations in TCGA data (for example, see Ciani et al. Cell Sys 2022 PMID: 34731645). Given the focus on TSG of this study, I think it is important to account for this in the analysis.

      Thank you for making this point. We did initially consider inferring allele-specific mutation status, but decided against it as this would have shrunk our dataset substantially, thus potentially introducing unwanted biases. Determining whether two mutations lie on the same or on different alleles requires either (1) observing sequencing reads that either cover the loci of both mutations, or (2) tracing whether (sets of) other SNPs on the same gene co-occur exclusively with one of the two considered mutations. These requirements lead to a substantial filtering of the observed mutations. Moreover, this filtering would be especially strong for tumors with a small overall mutation burden, as these would have fewer co-occurring SNPs to leverage in this inference. We would have hence preferentially filtered out TSG-deactivating mutations in tumors with low mutation burden. We have modified the text to address this point (P.14).

      (7) To continue point 4. I wonder why some known cancer types with high APOBEC signatures (e.g., lung, mentioned in the introduction) do not appear in the results of Figure 4. Can the author explain why it is missed?

      We do provide complete results for all categories in Supplementary Figure 3. To not overwhelm the figure in the main text, we only show the four categories with the highest average APOBEC signature contribution, beyond those four, average APOBEC signature contributions quickly drop. Lung-related categories do not feature in these top four (Lung squamous cell carcinoma are fifth and Lung adenocarcinoma are eighth in this ordering).

      Minors:

      (1) It is worth mentioning the relevance to resistance to treatment (see https://www.nature.com/articles/s41588-025-02187-1).

      Thank you for this suggestion. We have included a mention of the relation to this paper in the discussion section (P. 11).

      (2) Some of the figures' resolution should be improved - specifically, Figures 4, S1, and S5, which are not clear/readable.

      Thank you for pointing this out. This was the result of conversion to a word document. We will provide tif files in the revisions to have better resolution.

      (3) Regarding Figure 3e,f. How come that moving from K=1 to K=I doesn't show any changes in fitness - it looks as if in both cases the value fluctuates around comparable mean fitness? Is that the case?

      While fitness differences between simulations with different k manifest robustly over long time-horizons (see Fig 3C with results over  generations), there are various sources of substantial stochasticity that make the fitness values in these short-term plots (Fig3D-F) imperfect illustrations of how long-term average fitness behaves. For instance, fitness landscapes are drawn randomly which introduces variability in how high and how close-by different fitness peaks are. Similarly, there is substantial randomness since both the type (direction on the 2-D fitness landscape) and the timing of mutation are stochastic.

      The short-term plots in Fig3D-F are intended to showcase representative dynamics of transitions between points on the genotype space with different fitness values following a redrawing of the landscape – but not necessarily to provide a comparison between the height of the attained (local) fitness-maxima.  

      (4) Figures 4c,d - correlation should be Spearman, not Pearson (it's not a linear relationship).

      Thank you for this comment. As a robustness check, we have generated the same figures using Spearman and not Pearson correlations and find results that are qualitatively consistent with the initially shown results. Indeed, using Spearman correlations, all four cancer types from Fig 4D have significant correlations.

      (5) Typo for E) "...in samples of the cancer types in (C) were caused by APOBEC" - it should be D (not C) I guess.

      Thank you for catching this. We fixed the typo.

      (6) Figure 5 - the mutation rate is too high (0.001), sensitivity to that? Also the fitness change is exaggerated (0.5, 1.5), and the division of mutations to 100 and 100 (200 in total) loci is not clear.

      Thank you for making this point. In this simulation setting it is unfortunately computationally prohibitively expensive to perform simulations at biologically realistic mutation rates. Therefore, we have scaled up the mutation rate while scaling down the population size. Moreover, the choice of model here is not meant to resemble a biologically realistic dynamic, but rather to create a stylized setting to be able to consider the interplay between clonal interference and facilitated valley-crossing in isolation. The key result from this figure is the separation of time scales at which low or high temporal clustering maximizes adaptability.

      However, known parameter dependencies in these models allow us to reason about how tuning individual parameters of this stylized model would affect the relative importance of effects of clonal interference. This relative importance is largest when mutants are likely to co-occur on different competing clones in a population. The likelihood of such co-occurrences decreases substantially if decreasing the mutation rate to biologically realistic values. However, this likelihood also sensitively depends on the time that it takes a clone with a one-step adaptation to spread through the population. Smaller fitness advantages, as well as larger population sizes, slow down this process of taking over the population, which increases the likelihood of clonal interference. We now discuss these points in our revised manuscript (P. 8).

      7) In the results text (last section) "Performing simulations for 2-step adaptations, we found that fixation rates are non-monotone in k. While at low k increasing k leads to a steep increase in the fixation rate, this trend eventually levels off and becomes negative, with further increases in k leading to a decrease in the fixation rate". Where are the results of this? It should be bold and apparent.

      Thank you for alerting us that this is unclear. The relevant figure reference is indeed Fig 5C as in the preceding passage in the manuscript. However, we noticed that due to the presence of the steadily decreasing black line for 1-step adaptations, it is not easy to see that also the blue line is downward sloping. We have added a further reference to Fig 5C, and have adapted the grid spacing in the background of that figure-panel to make this trend more easily visible.

      (8) Although not inconceivable, conclusions regarding resistance in the discussion are overstated. If you want to make this statement, you need to show that in resistant tumors, the temporal mutagenesis is responsible for progression vs. non-resistant/sensitive cases (is that the case), otherwise this should be toned down.

      Thank you for pointing this out. We have tempered these conclusions in the revised version of the manuscript (P. 11).

      Reviewer #2 (Recommendations for the authors):

      (1) It might be useful to look specifically at X-linked TSGs. On the authors' interpretation, their relative inactivation rates should not be correlated with APOBEC signatures in males (but should be in females), though the size of the dataset may preclude any definite conclusions.

      Thank you for this suggestion. Indeed, the size of the dataset unfortunately makes such analyses infeasible. Moreover, it is not clear whether X-linked TSGs might have structurally different fitness dynamics than TSGs on other chromosomes. However, this is an interesting suggestion worth following up on as more synergistic pairs confined to the X-chromosome are getting identified.

      (2) Might there be value in distinguishing tumors that carry mutations expected to increase APOBEC expression from those that do not? Among several reasons, an APOBEC signature due to such a mutation and an APOBEC signature due to abortive viral infection may differ with respect to the degree of punctuation.

      This is also an interesting suggestion for future investigations, but for which we unfortunately do not have sufficient information to build a meaningful analysis. In particular, it is unclear to what extent the degree and manifestation of episodicity/punctuation varies between these different mechanisms. Burst duration and intensity, as well as out-of-burst baseline rates of APOBEC mutagenesis likely differ in ways that are yet insufficiently characterized, which would make any result of analyses like these in Fig 4 hard to interpret.

      (3) Also, in that paragraph, is "proportional to" used loosely to mean "an increasing function of"?

      Thank you for this comment. We are not quite sure which paragraph is meant, but we use the term “proportional” in a literal sense at every point it is mentioned in the paper.

      For the occurrences of the term on pages 3, 10 and 11, the word is used in reference to probabilities of reproduction (division in the branching process, or ‘being drawn to populate a spot in the next generation’ in the WF process) being “proportional” to fitness. These probabilities are constructed by dividing each individual cell’s fitness by the total fitness summed across all cells in the population. As the population acquires fitness-enhancing mutations, the resulting proportionality constant (1/total_fitness) changes, so that the mapping from ‘fitness’ to probability of reproduction in the next reproduction event changes over time. Nevertheless, this mapping always remains fitness-proportional.

      On page 4, the term is used as follows: “the absolute rates 𝑓<sub>𝑘</sub> and 𝑓<sub>1</sub> are proportional to µ<sup>n+1”</sup>. Here, proportionality in the literal sense follows from the equations on page 20, when setting , so that the second factor becomes µ<sup>n+1</sup>.  We have included a clarifying sentence to address this in the derivations (SI1).

      (4) It could be mentioned in the main text that the time between bursts (d) must not be too short in order for the effect to be substantial. I would think that the relevant timescale depends on how deleterious the initial mutation is.

      Thank you for making this interesting and very relevant point. We have included a section (SI3) and Figure (Fig S4) in the supplement to investigate the dependence on d. In short, we find that effects are weaker for small inter-burst intervals. The sensitivity to the burst size is highest for inter-burst intervals that are sufficiently small so that the lineage of the first mutant has relevant probability of surviving long enough to experience multiple burst phases.

      (5) Why not report that relative rate for Figure 2E as for 2D, as the former would seem to be more relevant to TSGs? And why was it assumed that the first inactivation is deleterious in the simulations in Figure 4 if the goal is to model TSGs?

      Thank you for noting this. For how we revised the paper to better connect Figures 2 and 4, please see our comment (A1) above. In general, neither 2E nor 2D should serve as quantitative predictions for what effect size we should expect in real world data, but are rather curated illustrations of the general phenomenon that we describe: we chose high mutation rates and exaggerated fitness effects so that dynamics become visually tractable in small simulation examples.

      For figure 4, assuming that the first inactivation is deleterious achieves that the branching process for the mutant lineage becomes subcritical, which keeps the simulation example simple and illustrative. For more comprehensive motivation of the approach in 4D, and especially the discussion of how fitness effects of different magnitudes may or may not be subject to the effects we describe depending on whether the population is in a phase of constant or growing population size, we refer the reader to our added section SI2, and the added discussion on pages 6 and 10.

      (6) Figure 2, D and E. I'm not sure why heatmaps with height one were provided rather than simple plots over time. It is difficult, for example, to determine from a heatmap whether the increase is linear or the relative rates with and without punctuation.

      Thank you for this comment. These are not heatmaps with height one, but rather for every column of pixels, different segments of that column correspond to different clones within that population. This approach is intended to convey the difference in dynamics between the results in Fig 2 and the analogous results for a branching process in Fig S1. In Fig 2, valley-crossings happen sequentially, with subsequent fixations of adapted mutants. In Fig S1, with a growing population size, multiple clones with different numbers of adaptations coexist. We have now adapted the caption of Fig 2 to clarify this point.

      (7) Page 3: "High mutation rates are known to limit the rate of 1-step adaptations due to clonal interference." This is a bit misleading, as it makes it sound like increasing the mutation rate decreases the rate of one-step adaptations.

      Thank you for alerting us to this poor phrasing. We have changed it in the revised version of the manuscript (P. 3).

      (8) Page 4: "proportional to \mu^{n+1}" Is "proportional" being used loosely for "an increasing function of"?

      It is meant in the literal mathematical sense (see response to comment (3))

      (9) Page 5, near bottom: "at least two mutations across the population". In the same genome?

      We counted mutations irrespective of whether they emerged in the same genome, to remain analogous to the TCGA analyses for which we also do not have single cell-resolved information.

      (10) Page 6: "missense or nonsense mutation". What about indels? If these are not affected by APOBEC, omitting them will exaggerate the effect of punctuation.

      Thank you for pointing out that this focus on single nucleotide substitutions conveys an exaggerated image of the importance of this effect of APOBEC-driven mutagenesis. There are of course several other classes of (epi)genomic alterations (e.g. chromatin modifications, methylation changes, copy number changes) that we do not consider in this part of our analysis. APOBEC mutagenesis serves as an example of a temporally clustered mutation process, which we investigate in its domain of action.

      We have added further discussion (P. 10-11) to convey that our empirical results merely constitute an investigation of whether empirical patterns are consistent with our hypothesis, but that the narrow focus on only SNVs, only TSGs, and only APOBEC mutagenesis does not allow for a general quantitative statement about the in-vivo relevance of the phenomena we describe.

      (11) Page 6: "normalized by the total number of single nucleotide substitutions." It is difficult to know how to normalize correctly, but I might think that the square of the number of substitutions would be more appropriate. Perhaps the total numbers are close enough that it matters little.

      Thank you for noting this. In the revised manuscript we have now expanded this passage in the text to more clearly convey our motivations for why we normalize by the total number of single nucleotide substitutions. While the likelihood for crossing a fitness valley with 2 mutations is indeed proportional to the square of the mutation rate, we do not directly observe mutation rates from our data.  Rather, we observe the number of acquired single nucleotide substitutions for every tumor sample, but since tumors in our data differ in the time since initiation and therefore differ in the numbers of divisions their cells have undergone before being sequenced, we cannot directly infer mutation rates. One way to phrase our main result about valley-crossing is that temporally clustered mutation processes have an increased rate of successful valley-crossings per attempted valley crossing. Our TSG deactivation score is constructed to reflect this idea. The number of TSGs serves as a proxy for successful valley-crossings and the total mutation burden serves as a proxy for attempted valley-crossings.

      To convey these points more clearly, we have rewritten the first paragraph in the Section “Proxies for valley crossing and for temporal clustering found in patient data” (P.6)

      (12) Perhaps embed links to the COSMIC web pages for SBS2 and SBS13 in the text.

      Thank you for this suggestion. We have embedded the links at the first mention of SBS2 and SBS13 in the text.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this study, Jeong and Choi examine neural correlates of behavior during a naturalistic foraging task in which rats must dynamically balance resource acquisition (foraging) with the risk of threat. Rats first learn to forage for sucrose reward from a spout, and when a threat is introduced (an attack-like movement from a "LobsterBot"), they adjust their behavior to continue foraging while balancing exposure to the threat, adopting anticipatory withdraw behaviors to avoid encounter with the LobsterBot. Using electrode recordings targeting the medial prefrontal cortex (PFC), they identify heterogenous encoding of task variables across prelimbic and infralimbic cortex neurons, including correlates of distance to the reward/threat zone and correlates of both anticipatory and reactionary avoidance behavior. Based on analysis of population responses, they show that prefrontal cortex switches between different regimes of population activity to process spatial information or behavioral responses to threat in a context-dependent manner. Characterization of the heterogenous coding scheme by which frontal cortex represents information in different goal states is an important contribution to our understanding of brain mechanisms underlying flexible behavior in ecological settings.

      Strengths:

      As many behavioral neuroscience studies employ highly controlled task designs, relatively less is generally known about how the brain organizes navigation and behavioral selection in naturalistic settings, where environment states and goals are more fluid. Here, the authors take advantage of a natural challenge faced by many animals - how to forage for resources in an unpredictable environment - to investigate neural correlates of behavior when goal states are dynamic. Related to his, they also investigate prefrontal cortex (PFC) activity is structured to support different functional "modes" (here, between a navigational mode and a threat-sensitive foraging mode) for flexible behavior. Overall, an important strength and real value of this study is the design of the behavioral experiment, which is trial-structured, permitting strong statistical methods for neural data analysis, yet still rich enough to encourage natural behavior structured by the animal's volitional goals. The experiment is also phased to measure behavioral changes as animals first encounter a threat, and then learn to adapt their foraging strategy to its presence. Characterization of this adaptation process is itself quite interesting and sets a foundation for further study of threat learning and risk management in the foraging context. Finally, the characterization of single-neuron and population dynamics in PFC in this naturalistic setting with fluid goal states is an important contribution to the field. Previous studies have identified neural correlates of spatial and behavioral variables in frontal cortex, but how these representations are structured, or how they are dynamically adjusted when animals shift their goals, has been less clear. The authors synthesize their main conclusions into a conceptual model for how PFC activity can support mode switching, which can be tested in future studies with other task designed and functional manipulations.

      Weaknesses:

      While the task design in this study is intentionally stimulus-rich and places minimal constraint on the animal to preserve naturalistic behavior, this also introduces confounds that limit interpretability of the neural analysis. For example, some variables which are the target of neural correlation analysis, such as spatial/proximity coding and coding of threat and threat-related behaviors, are naturally entwined. To their credit, the authors have included careful analyses and control conditions to disambiguate these variables and significantly improve clarity.

      The authors also claim that the heterogenous coding of spatial and behavioral variables in PFC is structured in a particular way that depends on the animal's goals or context. As the authors themselves discuss, the different "zones" contain distinct behaviors and stimuli, and since some neurons are modulated by these events (e.g., licking sucrose water, withdrawing from the LobsterBot, etc.), differences in population activity may to some extent reflect behavior/event coding. The authors have included a control analysis, removing timepoints corresponding to salient events, to substantiate the claim that PFC neurons switch between different coding "modes." While this significantly strengthens evidence for their conclusion, this analysis still depends on relatively coarse labeling of only very salient events. Future experiment designs, which intentionally separate task contexts (e.g. navigation vs. foraging), could serve to further clarify the structure of coding across contexts and/or goal states.

      Finally, while the study includes many careful, in-depth neural and behavioral analyses to support the notion that modal coding of task variables in PFC may play a role in organizing flexible, dynamic behavior, the study still lacks functional manipulations to establish any form of causality. This limitation is acknowledged in the text, and the report is careful not to over interpret suggestions of causal contribution, instead setting a foundation for future investigations.

      Thank you for the positive comment. We also acknowledge the inherent drawbacks of studying naturalistic behavior. As you also mentioned in the second round of review, separating navigation and foraging tasks in a larger apparatus, such as the one illustrated below, could better distinguish neural activity patterns associated with these different task types. To address the limitations of the current study, we have revised the report to avoid overinterpretation or unwarranted assumptions, and we appreciate that you have recognized this effort.

      Author response image 1.

      Reviewer #2 (Public review):

      Summary:

      Jeong & Choi (2023) use a semi-naturalistic paradigm to tackle the question of how the activity of neurons in the mPFC might continuously encode different functions. They offer two possibilities: either there are separate dedicated populations encoding each function, or cells alter their activity dependent on the current goal of the animal. In a threat-avoidance task rats procurred sucrose in an area of a chamber where, after remaining there for some amount of time, a 'Lobsterbot' robot attacked. In order to initiate the next trial rats had to move through the arena to another area before returning to the robot encounter zone. Therefore the task has two key components: threat avoidance and navigating through space. Recordings in the IL and PL of the mPFC revealed encoding that depended on what stage of the task the animal was currently engaged in. When animals were navigating, neuronal ensembles in these regions encoded distance from the threat. However, whilst animals were directly engaged with the threat and simultaneously consuming reward, it was possible to decode from a subset of the population whether animals would evade the threat. Therefore the authors claim that neurons in the mPFC switched between two functional modes: representing allocentric spatial information, and representing egocentric information pertaining to the reward and threat. Finally, the authors propose a conceptual model based on these data whereby this switching of population encoding is driven by either bottom-up sensory information or top-down arbitration.

      Strengths:

      Whilst these multiple functions of activity in the mPFC have generally been observed in tasks dedicated to the study of a singular function, less work has been done in contexts where animals continuously switch between different modes of behaviour in a more natural way. Being able to assess whether previous findings of mPFC function apply in natural contexts is very valuable to the field, even outside of those interested in the mPFC directly. This also speaks to the novelty of the work; although mixed selectivity encoding of threat assessment and action selection has been demonstrated in some contexts (e.g. Grunfeld & Likhtik, 2018) understanding the way in which encoding changes on-the-fly in a self-paced task is valuable both for verifying whether current understanding holds true and for extending our models of functional coding in the mPFC.

      The authors are also generally thoughtful in their analyses and use a variety of approaches to probe the information encoded in the recorded activity. In particular, they use relatively close analysis of behaviour as well as manipulating the task itself by removing the threat to verify their own results. The use of such a rich task also allows them to draw comparisons, e.g. in different zones of the arena or different types of responses to threat, that a more reduced task would not otherwise allow. Additional in-depth analyses in the updated version of the manuscript, particularly the feature importance analysis, as well as complimentary null findings (a lack of cohesive place cell encoding, and no difference in location coding dependent on direction of trajectory) further support the authors' conclusion that populations of cells in the mPFC are switching their functional coding based on task context rather than behaviour per se. Finally, the authors' updated model schematic proposes an intriguing and testable implementation of how this encoding switch may be manifested by looking at differentiable inputs to these populations.

      Weaknesses:

      The main existing weakness of this study is that its findings are correlational (as the authors highlight in the discussion). Future work might aim to verify and expand the authors' findings - for example, whether the elevated response of Type 2 neurons directly contributes to the decision-making process or just represents fear/anxiety motivation/threat level - through direct physiological manipulation. However, I appreciate the challenges of interpreting data even in the presence of such manipulations and some of the additional analyses of behaviour, for example the stability of animals' inter-lick intervals in the E-zone, go some way towards ruling out alternative behavioural explanations. Yet the most ideal version of this analysis is to use a pose estimation method such as DeepLabCut to more fully measure behavioural changes. This, in combination with direct physiological manipulation, would allow the authors to fully validate that the switching of encoding by this population of neurons in the mPFC has the functional attributes as claimed here.

      I wanted to add a minor comment about interpreting the two possible accounts presented in fig. 8 to suggest a third possibility: that both bottom-up sensory and top-down arbitration mechanisms can occur simultaneously to influence whether the activity of the population switches. Indeed, a model where these inputs are balanced or pitted against each other, so to speak, to continuously modulate encoding in the mPFC seems both adaptive and likely. Further, some speculation on the source of the 'arbitrator' in the top-down account would make this model more tractable for future testing of its validity.

      We thank the reviewer for highlighting this important perspective. We fully agree that an intricate and recurrent interaction between bottom-up and top-down modulations is a highly plausible account of how the mPFC changes its encoding mode. In line with this suggestion, we have incorporated this idea as a third possibility in the revised Discussion, alongside an updated version of Figure 8 that explicitly illustrates this competitive model.

      Although we were unable to identify a definitive study directly measuring how the mPFC switches encoding modes across tasks, we did find relevant human EEG and fMRI studies addressing this issue. Based on these findings, we now propose the anterior cingulate cortex (ACC) as a potential hub for top-down arbitration. We have added a paragraph in the Discussion describing this possibility and its implications for future testing.

      “Which brain region might act as this arbitrator? Evidence from human neuroimaging studies implicates the anterior cingulate cortex (ACC) as a central hub for switching cognitive modes. During task switching, the ACC shows increased activation (Hyafil et al., 2009), enhances connectivity with task-specific regions (Aben et al., 2020), correlates with multitask performance (Kondo et al., 2004), and monitors the reliability of competing decision systems (Lee et al., 2014). Collectively, these findings point to a pivotal role for the ACC in coordinating task assignment. Rodent studies also link the ACC to strategic mode switching (Tervo et al., 2014), suggesting that the rodent ACC could similarly arbitrate between strategies, determining which task-relevant variables are represented in the ventral mPFC, including the PL and IL. Future studies combining multi-context tasks with causal manipulations will be essential to determine whether these functional shifts are driven primarily by top-down arbitration or by bottom-up sensory inputs.”

      Reviewer #3 (Public review):

      Summary:

      This study investigates how various behavioral features are represented in the medial prefrontal cortex (mPFC) of rats engaged in a naturalistic foraging task. The authors recorded electrophysiological responses of individual neurons as animals transitioned between navigation, reward consumption, avoidance, and escape behaviors. Employing a range of computational and statistical methods, including artificial neural networks, dimensionality reduction, hierarchical clustering, and Bayesian classifiers, the authors sought to predict from neural activity distinct task variables (such as distance from the reward zone and the success or failure of avoidance behavior). The findings suggest that mPFC neurons alternate between at least two distinct functional modes, namely spatial encoding and threat evaluation, contingent on the specific location.

      Strengths:

      This study attempt to address an important question: understanding the role of mPFC across multiple dynamic behaviors. The authors highlight the diverse roles attributed to mPFC in previous literature and seek to explain this apparent heterogeneity. They designed an ethologically relevant foraging task that facilitated the examination of complex dynamic behavior, collecting comprehensive behavioral and neural data. The analyses conducted are both sound and rigorous.

      Weaknesses:

      Because the study still lacks experimental manipulation, the findings remain correlational. The authors have appropriately tempered their claims regarding the functional role of the mPFC in the task. The nature of the switch between functional modes encoding distinct task variables (i.e., distance to reward, and threat-avoidance behavior type) is not established. Moreover, the evidence presented to dissociate movement from these task variables is not fully convincing, particularly without single-session video analysis of movement. Specifically, while the new analyses in Figure 7 are informative, they may not fully account for all potential confounding variables arising from changes in context or behavior.

      Regarding the claim of highly stereotyped behavior, there are some inconsistencies. While the authors assert this, Figure 1F shows inter-animal variability, and the PETHs, representing averaged activity, may not fully capture the variability of the behavior across sessions and animals. To strengthen this aspect, a more detailed analysis that examines the relationship between behavior and neural activity on a trial-by-trial basis, or at minimum, per session, could help.

      We thank the reviewer for this thoughtful recommendation and the opportunity to clarify our use of the term “stereotyped behavior.” By this, we were specifically referring to the animals’ consistent licking behavior in the E-zone, rather than to the latency of head withdrawal, which indeed varied across trials and animals. Because licking tempo and body posture during sucrose consumption were highly consistent, the decision to avoid or stay (AW vs. EW) could not be predicted from overt behavior alone. This consistency strengthens our conclusion that the significant predictive power of the Bayesian decoding analysis reflects intrinsic firing patterns of the mPFC neural network, rather than simple behavioral correlates of avoidance.

      We also note that the Bayesian model was conducted on a trial-by-trial basis, and the reported prediction accuracy of 73% represents the average across all individual trials (Figure 6B, C). Thus, the analysis inherently captures variability across trials and animals, directly addressing the reviewer’s concern.

      The reviewer is correct that the PETHs shown in Figure 5 are based on session-averaged activity aligned to head-entry and head-withdrawal events. The purpose of this analysis was to illustrate that certain modulation patterns could be grouped into 2–3 distinct categories. While averaged activity can provide insight into collective responses to external events, we agree that trial-based analyses provide a more rigorous demonstration of the link between neural ensemble activity and behavioral decisions. This is precisely why we complemented the PETH analysis with Bayesian decoding, which provides stronger evidence that mPFC ensemble activity is predictive of the animal’s choice to avoid or stay.

      Similarly, the claim regarding the limited scope of extraneous behavior (beyond licking) requires further substantiation. It would be more convincing to quantify potential variations in licking vigor and to provide evidence for the absence of significant postural changes.

      To address this concern, we quantified licking vigor using the inter-lick interval (ILI) as an indirect index. A lick was defined as the period from tongue contact with the IR beam (Lick-On) to withdrawal (Lick-Off), and the ILI was calculated as the time between a Lick-Off and the subsequent Lick-On. Across all animals, ILIs were clustered within a narrow range with a median of 0.155 s (see Author response image 4, left panel).

      We analyzed licking vigor at two levels: within trials and within sessions. Because reduced vigor or satiation would lengthen ILIs, comparing the first half and the last half of ILIs within a trial or within a session provides a sensitive proxy for licking consistency.

      Within trials: For each of 2,820 trials, we compared the mean ILI of the first half of licks to that of the second half. The average difference was only ~ 17 ms (middle panel). Across sessions: Trial-averaged ILIs were compared between the first and last halves of each session, yielding a mean difference ~ 1.7 ms per session (right panel).

      These analyses demonstrate that rats maintained stable licking vigor whenever they entered the E-zone, regardless of avoidance outcome.

      Author response image 2.

      Concerning the ANN model, while I understand the choice of a 4-layer network for its performance, the study could have benefited from exploring simpler models. A model where weight corresponds directly to individual neurons could improve interpretability and facilitate the investigation of dynamic changes in neuronal 'modes' (i.e., weight adjustments) over time.

      We fully agree with the reviewer on the importance of biologically interpretable models. While artificial neural networks (ANNs) share certain similarities with neural computation, they are not intended to capture biological realism. For example, the error correction mechanism used in ANNs, such as backpropagation has no direct counterpart in mammalian neural circuits. Although we considered approaches that would link each computational node more directly to the activity of individual neurons, building such a model would require temporally sensitive, mechanistic frameworks (e.g., leaky integrate-and-fire networks) and an extensive behavioral alignment effort, which is beyond the scope of the current study.

      Our use of an ANN was intended solely as an analytical tool to uncover hidden patterns in multi-unit activity that may not be detectable with traditional methods. Among various machine-learning algorithms, we selected a four-layer ANN regressor because it achieved significantly lower decoding errors (Supplementary Figure S3) and showed robustness to hyperparameter variation (Glaser et al., 2020). To acknowledge the limitations of this approach and suggest future directions, we have revised the Results section to explicitly discuss these points.

      “Among various machine learning algorithms, we selected a robust tool for decoding underlying patterns in the data, rather than to model the architecture of the mPFC. We implemented a four-layer artificial neural network regressor (ANN; see Materials and Methods for a detailed structure), as the ANN achieves significantly lower decoding errors (Supplementary Figure S3) and has robustness to hyperparameter changes (Glaser et al., 2020).”

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review): 

      In their revised manuscript, Chen et al. have added additional data that establishes GPR30 spinal neurons as a population of excitatory neurons, half of which express CCK. These data help to position GPR30 neurons in the existing framework of spinal neuron populations that contribute to neuropathic pain, strengthening the author's findings.

      Thank you very much for your positive feedback and for recognizing the value of our additional data.

      Reviewer #3 (Public review):

      The authors did an excellent job addressing many of the critiques raised. Despite acknowledging that a direct functional corticospinal projection to CCK/GPR30+neurons is not supported by the data and revising the title, these claims still persist throughout the manuscript. Manipulating gene expression or the activity of postsynaptic neurons through a trans-synaptic labeling strategy does not directly support any claim that those upstream neurons are directly modulating spinal neurons through the proposed pathway. Indeed they might, but that is not demonstrated here.

      We sincerely thank the reviewer for this critical insight. We fully agree that our trans-synaptic approach does not provide a direct functional connection. In response, we have revised the manuscript to remove any overstated claims of "direct" modulation and instead emphasize the critical role of spinal GPR30+ neurons. Moreover, we have added a statement in the Discussion to acknowledge this limitation and to highlight that the precise function role of this connection requires further investigation in further studies.

      Reviewer #1 (Recommendations for the authors): 

      I recommend 2 minor corrections to the text and figures

      (1)  Line 131 : "What's more, near-universal CCK+ neurons were co-localized with GPR30 (Fig 2F and G)."

      The additional quantification of the overlap between GPR30 and tdTomato provided by the authors is useful, but there are inconsistencies with how the data are reported in the figures and text, making them difficult to interpret. 2F supports the author's conclusion that approximately 90% of CCK⁺ neurons express GPR30, and about 50% of GPR30⁺ neurons co-express CCK. However, the x-axis labels in 2G appear to have been switched, and suggest that the opposite is true (i.e., most GRPR neurons are CCK+, while only 50% of CCK neurons are GPR30+). Please clarify which is correct throughout the results and discussion sections.

      Thank you for identifying this important error. We apologized for the confusion caused by the mislabeled x-axis in Fig. 2G. The x-axis labels were indeed inadvertently switched. The correct data is that approximately 90% of CCK<sup>+</sup> neurons express GPR30. We have corrected the figure and have carefully reviewed the entire manuscript to ensure all related descriptions and discussions are consistent with the accurate quantification.

      (2) The following sentence describing Figure 5 was hard to follow: Lines 190-192, "Consistent with prior observations, we found that these SDH downstream neurons exhibited colocalization with CCK+ neurons, with 28.1% of mCherry+ neurons expressing CCK (Fig 5I and J)." Since the authors are describing a common population of neurons, a statement describing the coexpression (rather than the colocalization" would more simply summarize their data.

      We thank the reviewer for this helpful suggestion. We fully agree that "coexpression" is a more precise term for the description. We have revised the sentence on Lines 189-190 to read: "Consistent with prior observations, we found that 28.1% of mCherry+ S1-SDH downstream neurons coexpressed CCK (Fig 5I and J)."

      Reviewer #3 (Recommendations for the authors): 

      Additional Recommendations

      The authors did a commendable job revising the manuscript text to improve readability; however, several informal phrases from the original version still persist, or were added (e.g. "by the way").

      We thank the reviewer for this valuable feedback regarding the language. We have conducted a line-by-line review of the entire manuscript to identify all remaining informal phrases, and replaced them with more appropriate phrasing.

      It should be clearly mentioned that spontaneous E/IPSCs were recorded in Figure 4 and Fig S5.

      We thank the reviewer for this helpful suggestion. We have now clearly indicated the spontaneous E/IPSCs in Fig. 4 and Fig. S5 and manuscript.

      The rationale for recording EPSCs from GFP-labeled CCK+ neurons because "a significant proportion of spinal CCK+ neurons form excitatory synapses with upstream neurons" does not make any sense. Do the authors instead mean that CCK neurons receive excitatory inputs from other spinal neurons and intend to test if those synaptic connections are modulated by GPR30?

      We thank the reviewer for this critical correction. Our intended meaning was indeed that CCK<sup>+</sup> neurons receive excitatory inputs from other neurons, and we aimed to test whether those synaptic connections are modulated by GPR30. To avoid confusion, we have revised the manuscript to remove the erroneous statement “Since CCK+ neurons mainly receive excitatory synaptic inputs from upstream neurons, we then intended to test whether GPR30 modulated these synaptic connections.”

      I am confused by the statement on Page 8 "to examine whether GPCR30-mediated EPSCs depend on AMPA mediated currents." Given that sEPSCs were recorded at -70 mV in low Cl internal I'm not sure what other glutamate receptor would be involved. Perhaps the intention was to more directly test whether GPR30 activation acutely modulates AMPAR-mediated EPSCs? However, as the authors acknowledged, this experiment does not necessarily support a solely post-synaptic AMPAR-dependent mechanism.

      We thank the reviewer for this insightful comment and apologize for the lack of clarity. Our intention was indeed to test whether GPR30 activation modulates AMPAR-mediated currents. We have revised the text. In addition, we also emphasize in the Discussion that our data did not rule out the potential pre-synaptic contributions to this effect.

      An elevation in EPSCs within a cell does not necessarily mean that the cell is more excitable, only that it is receiving more excitatory inputs or has an increase in synaptic receptors. The cell may scale down its activity to compensate for this increase. I recommend only drawing conclusions from what the experiments actually tested.

      We thank the reviewer for this crucial clarification. We have revised the manuscript to remove any claims that the cells were "more excitable". Our conclusions now strictly focus on the specific findings that GPR30 activation enhanced the excitatory transmission onto CCK<sup>+</sup> neurons.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In the Late Triassic and Early Jurassic (around 230 to 180 Ma ago), southern Wales and adjacent parts of England were a karst landscape. The caves and crevices accumulated remains of small vertebrates. These fossil-rich fissure fills are being exposed in limestone quarrying. In 2022 (reference 13 of the article), a partial articulated skeleton and numerous isolated bones from one fissure fill of end-Triassic age (just over 200 Ma) were named Cryptovaranoides microlanius and described as the oldest known squamate - the oldest known animal, by some 20 to 30 Ma, that is more closely related to snakes and some extant lizards than to other extant lizards. This would have considerable consequences for our understanding of the evolution of squamates and their closest relatives, especially for their speed and absolute timing, and was supported in the same paper by phylogenetic analyses based on different datasets.

      In 2023, the present authors published a rebuttal (reference 18) to the 2022 paper, challenging anatomical interpretations and the irreproducible referral of some of the isolated bones to Cryptovaranoides. Modifying the datasets accordingly, they found Cryptovaranoides outside Squamata and presented evidence that it is far outside. In 2024 (reference 19), the original authors defended most of their original interpretation and presented some new data, some of it from newly referred isolated bones. The present article discusses anatomical features and the referral of isolated bones in more detail, documents some clear misinterpretations, argues against the widespread but not justifiable practice of referring isolated bones to the same species as long as there is merely no known evidence to the contrary, further argues against comparing newly recognized fossils to lists of diagnostic characters from the literature as opposed to performing phylogenetic analyses and interpreting the results, and finds Cryptovaranoides outside Squamata again.

      Although a few of the character discussions and the discussion of at least one of the isolated bones can probably still be improved (and two characters are addressed twice), I see no sign that the discussion is going in circles or otherwise becoming unproductive. I can even imagine that the present contribution will end it.

      We appreciate the positive response from reviewer 1!

      Reviewer #2 (Public review):

      Congratulations on this thorough manuscript on the phylogenetic affinities of Cryptovaranoides.

      Thank you.

      Recent interpretations of this taxon, and perhaps some others, have greatly changed the field's understanding of reptile origins- for better and (likely) for worse.

      We agree, and note that while it is possible for challenges to be worse than the original interpretations, both the original and subsequent challenges are essential aspects of what make science, science.

      This manuscript offers a careful review of the features used to place Cryptovaranoides within Squamata and adequately demonstrates that this interpretation is misguided, and therefore reconciles morphological and molecular data, which is an important contribution to the field of paleontology. The presence of any crown squamate in the Permian or Triassic should be met with skepticism, the same sort of skepticism provided in this manuscript.

      We agree and add that every testable hypothesis requires skepticism and testing.

      I have outlined some comments addressing some weaknesses that I believe will further elevate the scientific quality of the work. A brief, fresh read‑through to refine a few phrases, particularly where the discussion references Whiteside et al. could also give the paper an even more collegial tone.

      We have followed Reviewer 2’s recommendations closely (see below) and have justified in our responses if we do not fully follow a particular recommendation.

      This manuscript can be largely improved by additional discussion and figures, where applicable. When I first read this manuscript, I was a bit surprised at how little discussion there was concerning both non-lepidosauromorph lepidosaurs as well as stem-reptiles more broadly. This paper makes it extremely clear that Cryptovaranoides is not a squamate, but would greatly benefit in explaining why many of the characters either suggested by former studies to be squamate in nature or were optimized as such in phylogenetic analyses are rather widespread plesiomorphies present in crownward sauropsids such as millerettids, younginids, or tangasaurids. I suggest citing this work where applicable and building some of the discussion for a greatly improved manuscript. In sum:

      (1) The discussion of stem-reptiles should be improved. Nearly all of the supposed squamate features in Cryptovaranoides are present in various stem-reptile groups. I've noted a few, but this would be a fairly quick addition to this work. If this manuscript incorporates this advice, I believe arguments regarding the affinities of Cryptovaranoides (at least within Squamata) will be finished, and this manuscript will be better off for it.

      (2) I was also surprised at how little discussion there was here of putative stem-squamates or lepidosauromorphs more broadly. A few targeted comparisons could really benefit the manuscript. It is currently unclear as to why Cryptovaranoides could not be a stem-lepidosaur, although I know that the lepidosaur total-group in these manuscripts lacks character sampling due to their scarcity.

      We are responding to (1) and (2) together. We agree with the Reviewer that a thorough comparison of Cryptovaranoides to non-lepidosaurian reptiles is critical. This is precisely what we did in our previous study: Brownstein et al. (2023)— see main text and supplementary information therein. As addressed therein, there is a substantial convergence between early lepidosaurs and some groups of archosauromorphs (our inferred position for Cryptovaranoides). Many of those points are not addressed in detail here in order to avoid redundancy and are simply referenced back to Brownstein et al. (2023). Secondly, stem reptiles (i.e., non-lepidosauromorphs and non-archosauromorphs), such as suggested above (millerettids, younginids, or tangasaurids), are substantially more distantly related to Cryptovaranoides (following any of the published hypotheses). As such, they share fewer traits (either symplesiomorphies or homoplasies), and so, in our opinion, we would risk directing losing the squamate-focus of our study.

      We thus respectfully decline to engage the full scope of the problem in this contribution, but do note that this level of detailed work would make for an excellent student dissertation research program.

      (3) This manuscript can be improved by additional figures, such as the slice data of the humerus. The poor quality of the scan data for Cryptovaranoides is stated during this paper several times, yet the scan data is often used as evidence for the presence or absence of often minute features without discussion, leaving doubts as to what condition is true. Otherwise, several sections can be rephrased to acknowledge uncertainty, and probably change some character scorings to '?' in other studies.

      We strongly agree with the reviewer. Unfortunately, the original publication (Whiteside et al., 2021) did not make available the raw CT scan data to make this possible. As noted below in the Responses to Recommendations Section, we only have access to the mesh files for each segmented element. While one of us has observed the specimens personally, we have not had the opportunity to CT scan the specimens ourselves.

      Reviewer #3 (Public review):

      Summary:

      The study provides an interesting contribution to our understanding of Cryptovaranoides relationships, which is a matter of intensive debate among researchers. My main concerns are in regard to the wording of some statements, but generally, the discussion and data are well prepared. I would recommend moderate revisions.

      Strengths:

      (1) Detailed analysis of the discussed characters.

      (2) Illustrations of some comparative materials.

      Thank you for noting the strengths inherent to our study.

      Weaknesses:

      Some parts of the manuscript require clarification and rewording.

      One of the main points of criticism of Whiteside et al. is using characters for phylogenetic considerations that are not included in the phylogenetic analyses therein. The authors call it a "non-trivial substantive methodological flaw" (page 19, line 531). I would step down from such a statement for the reasons listed below:

      (1) Comparative anatomy is not about making phylogenetic analyses. Comparative anatomy is about comparing different taxa in search of characters that are unique and characters that are shared between taxa. This creates an opportunity to assess the level of similarity between the taxa and create preliminary hypotheses about homology. Therefore, comparative anatomy can provide some phylogenetic inferences.

      That does not mean that tests of congruence are not needed. Such comparisons are the first step that allows creating phylogenetic matrices for analysis, which is the next step of phylogenetic inference. That does not mean that all the papers with new morphological comparisons should end with a new or expanded phylogenetic matrix. Instead, such papers serve as a rationale for future papers that focus on building phylogenetic matrices.

      We agree completely. We would also add that not every study presenting comparative anatomical work need be concluded with a phylogenetic analysis.

      Our criticism of Whiteside et al. (2022) and (2024) is that these studies provided many unsubstantiated claims of having recovered synapomorphies between Cryptovaranoides and crown squamates without actually having done so through the standard empirical means (i.e., phylogenetic analysis and ancestral state reconstruction). Both Whiteside et al. (2022) and (2024) indicate characters presented as ‘shared with squamates’ along with 10 characters presented as synapomorphies (10). However, their actual phylogenetically recovered synapomorphies were few in number (only 3) and these were not discussed.

      Furthermore, Whiteside et al. (2022) and (2024) comparative anatomy was restricted to comparing †Cryptovaranoides to crown squamates., based on the assumption that †Cryptovaranoides was a crown squamate and thus only needed to be compared to crown squamates.

      In conclusion, we respectfully, we maintain such efforts are “non-trivial substantive methodological flaw(s)”.

      (2) Phylogenetic matrices are never complete, both in terms of morphological disparity and taxonomic diversity. I don't know if it is even possible to have a complete one, but at least we can say that we are far from that. Criticising a work that did not include all the possibly relevant characters in the phylogenetic analysis is simply unfair. The authors should know that creating/expanding a phylogenetic matrix is a never-ending work, beyond the scope of any paper presenting a new fossil.

      Respectfully, we did not criticize previous studies for including an incomplete phylogeny. Instead, we criticized the methodology behind the homology statements made in Whiteside et al. (2022) and Whiteside et al. (2024).

      (3) Each additional taxon has the possibility of inducing a rethinking of characters. That includes new characters, new character states, character state reordering, etc. As I said above, it is usually beyond the scope of a paper with a new fossil to accommodate that into the phylogenetic matrix, as it requires not only scoring the newly described taxon but also many that are already scored. Since the digitalization of fossils is still rare, it requires a lot of collection visits that are costly in terms of time.

      We agree on all points, but we are unsure of what the Reviewer is asking us to do relative to this study.

      (4) If I were to search for a true flaw in the Whiteside et al. paper, I would check if there is a confirmation bias. The mentioned paper should not only search for characters that support Cryptovaranoides affinities with Anguimorpha but also characters that deny that. I am not sure if Whiteside et al. did such an exercise. Anyway, the test of congruence would not solve this issue because by adding only characters that support one hypothesis, we are biasing the results of such a test.

      We would refer the Reviewer to their section (1) on comparative anatomy. As we and the Reviewer have pointed out, Whiteside et al. did not perform comparative anatomical statements outside of crown Squamata in their original study. More specifically, Whiteside et al. (2022, Fig. 8) presented a phylogeny where Cryptovaranoides formed a clade with Xenosaurus within the crown of Anguimorpha or what they termed “Anguiformes”, and made comparisons to the anatomies of the legless anguids, Pseudopus and Ophisaurus. Whiteside et al. (2024), abandoned “Anguiformes”, maintained comparisons to Pseudopus and emphasized affinities with Anguimorpha (but almost all of their phylogenies as published, they do not recover a monophyletic Angumimorpha unless amphisbaenians and snakes are considered to be anguimorphans. Thus, we agree that confirmation bias was inherent in their studies.

      To sum up, there is nothing wrong with proposing some hypotheses about character homology between different taxa that can be tested in future papers that will include a test of congruence. Lack of such a test makes the whole argumentation weaker in Whiteside et al., but not unacceptable, as the manuscript might suggest. My advice is to step down from such strong statements like "methodological flaw" and "empirical problems" and replace them with "limitations", which I think better describes the situation.

      We agree with the first sentence in this paragraph – there is nothing wrong with proposing character homologies between different taxa based on comparative anatomical studies. However, that is not what Whiteside et al. (2022) and (2024) did. Instead, they claimed that an ad hoc comparison of Cryptovaranoides to crown Squamata confirmed that Cryptovaranoides is in fact a crown squamate and likely a member of Anguimorpha. Their study did not recognize limitations, but rather, concluded that their new taxon pushed the age of crown Squamata into the Triassic.

      As noted by Reviewer 2, such a claim, and the ‘data’ upon which it is based, should be treated with skepticism. We have elected to apply strong skepticism and stringent tests of falsification to our critique.

      Reviewer #1 (Recommendations for the authors):

      (1) Lines 596-598 promise the following: "we provide a long[-]form review of these and other features in Cryptovaranoides that compare favorably with non-squamate reptiles in Supplementary Material." You have kindly informed me that all this material has been moved into the main text; please amend this passage.

      This has been deleted.

      (2) Comments on science

      41: I would rather say "an additional role".

      This has been edited accordingly.

      43: Reconstructing the tree entirely from extant organisms and adding fossils later is how Hennig imagined it, because he was an entomologist, and fossil insects are, on average,e extremely rare and usually very incomplete (showing a body outline and/or wing venation and little or nothing else). He was wrong, indeed wrong-headed. As a historical matter, phylogenetic hypotheses were routinely built on fossils by the mid-1860s, pretty much as soon as the paleontologists had finished reading On the Origin of Species, and this practice has never declined, let alone been interrupted. As a theoretical matter, including as many extinct taxa as possible in a phylogenetic analysis is desirable because it breaks up long branches (as most recently and dramatically shown by Mongiardino Koch & Parry 2020), and while some methods and some kinds of data are less susceptible to long-branch attraction and long-branch repulsion than others, none are immune; and while missing data (on average more common in fossils) can actively mislead parametric methods, this is not the case with parsimony, and even in Bayesian inference the problem is characters with missing data, not taxa with missing data. Some of you have, moreover, published tip-dated phylogenetic analyses. As a practical matter, molecular data are almost never available from fossils, so it is, of course, true that analyses which only use molecular data can almost never include fossils; but in the very rare exceptions, there is no reason to treat fossil evidence as an afterthought.

      We agree and have changed “have become” to “is.”

      49-50, 59: The ages of individual fissure fills can be determined by biostratigraphy; as far as I understand, all specimens ever referred to Cryptovaranoides [13, 19] come from a single fill that is "Rhaetian, probably late Rhaetian (equivalent of Cotham Member, Lilstock Formation)" [13: pp. 2, 15].

      We appreciate this comment; the recent literature, however, suggests that variable ages are implied by the biostratigraphy at the English Fissure Fills, so we have chosen to keep this as is. Also note that several isolated bones were not recovered with the holotype but were discussed by Whiteside et al. (2024). The provenance of these bones was not clearly discussed in that paper.

      59-60: Why "putative"? Just to express your disagreement? I would do that in a less misleading way, for example: "and found this taxon as a crown-group squamate (squamate hereafter) in their phylogenetic analyses." - plural because [19] presented four different analyses of two matrices just in the main paper.

      We have removed this word.

      121-124: The entepicondylar foramen is homologous all the way down the tree to Eusthenopteron and beyond. It has been lost a quite small number of times. The ectepicondylar foramen - i.e., the "supinator" (brachioradialis) process growing distally to meet the ectepicondyle, fusing with it and thereby enclosing the foramen - goes a bit beyond Neodiapsida and also occurs in a few other amniote clades (...as well as, funnily enough, Eusthenopteron in later ontogeny, but that's independent).

      We agree. However, the important note here is that the features on the humerus of Cryptovaranoides are not comparable (differ in location and morphology) to the ent- and ectepondylar foramina in other reptiles, as we discuss at length. As such, we have kept this sentence as is.

      153: Yes, but you [18] mistakenly wrote "strong anterior emargination of the maxillary nasal process, which is [...] a hallmark feature of archosauromorphs" in the main text (p. 14) - and you make the same mistake again here in lines 200-206! Also, the fact [19: Figure 2a-c] remains that Cryptovaranoides did not have an antorbital fenestra, let alone an antorbital fossa surrounding it (a fossa without a fenestra only occurs in some cases of secondary loss of the fenestra, e.g., in certain ornithischian dinosaurs). Unsurprisingly, therefore, Cryptovaranoides also does not have an orbital-as-opposed-to-nasal process on its maxilla [19: Figure 2a-c].

      Line 243-249 (in original manuscript) deal with the emargination of maxillary nasal process (but this does not imply a full antorbital fenestra).  We explicitly state that this feature alone "has limited utility" for supporting archosauromorph affinity.

      158-173: The problem here is not that the capitellum is not preserved; from amniotes and "microsaurs" to lissamphibians and temnospondyls, capitella ossify late, and larger capitella attach to proportionately larger concave surfaces, so there is nothing wrong with "the cavity in which it sat clearly indicates a substantial condyle in life". Instead, the problem is a lack of quantification (...as has also been the case in the use of the exact same character in the debate on the origin of lissamphibians); your following sentence (lines 173-175) stands. The rest of the paragraph should be drastically shortened.

      We appreciate this comment. We note that the ontogenetic variation of this feature is in part the issue with the interpretation provided by Whiteside et al. (2024). The issue is the lack of consistency on the morphology of the capitellum in that study. We are unclear on what the reviewer means by ‘quantification,’ as the character in question is binary. 

      250-252: It's not going to matter here, but in any different phylogenetic context, "sphenoid" would be confusing given the sphenethmoid, orbitosphenoid, pleurosphenoid, and laterosphenoid. I actually recommend "parabasisphenoid" as used in the literature on early amniotes (fusion of the dermal parasphenoid and the endochondral basisphenoid is standard for amniotes).

      We have added "(=parabasisphenoid)" on first use but retain use of sphenoid because in the squamate and archosauromorph literature, sphenoid (or basisphenoid) is used more frequently.

      314-315: Vomerine teeth are, of course, standard for sarcopterygians. Practically all extant amphibians have a vomerine toothrow, for example. A shagreen of denticles on the vomer is not as widespread but still reaches into the Devonian (Tulerpeton).

      We agree, but vomerine teeth are rare in lepidosaurs and archosaurs and occur only in very recent clades e.g. anguids and one stem scincoid. Their presence in amphibians is not directly relevant to the phylogenetic placement of Cryptovaranoides among reptiles.

      372: Fusion was not scored as present in [13], but as unknown (as "partial" uncertainty between states 0 and 1 [19:8]), and seemingly all three options were explored in [19].

      We politely disagree with the reviewer; state 1 is scored in Whiteside et al. (2024).

      377-383: Together with the partially fused NHMUK PV R37378 [13: Figure 4B, C; 19: 8], this is actually an argument that Cryptovaranoides is outside but close to Unidentata. The components of the astragalus fuse so early in extant amniotes that there is just a single ossification center in the already fused cartilage, but there are Carboniferous and Permian examples of astragali with sutures in the expected places; all of the animals in question (Diadectes, Hylonomus, captorhinids) seem to be close to but outside Amniota. (And yet, the astragalus has come undone in chamaeleons, indicating the components have not been lost.) - Also, if NHMUK PV R37378 doesn't belong to a squamate close to Unidentata, what does it belong to? Except in toothless beaks, premaxillary fusion is really rare; only molgin newts come to mind (and age, tooth size, and tooth number of NHMUK PV R37378 are wholly incompatible with a salamandrid).

      The relevance of the astragalus is to the current discussion is unclear as we do not mention this element in our manuscript.  We discuss the fusion in the premaxillae in response to previous comment. 

      471-474: That thing is concave. (The photo is good enough that you can enlarge it to 800% before it becomes too pixelated.) It could be a foramen filled with matrix; it does not look like a grain sticking to the outside of the bone. Also, spell out that you're talking about "suc.fo" in Figure 3j.

      We are also a bit confused about this comment, as we state:

      “Finally, we note here that Whiteside et al. [19] appear to have labeled a small piece of matrix attached to a coracoid that they refer to †C. microlanius as the supracoroacoid [sic] foramen in their figure 3, although this labeling is inferred because only “suc, supracoroacoid [sic]” is present in their figure 3 caption.” (L. 519-522, P. 17). We cannot verify that this structure is concave, as so we keep this text as is.

      476-489: [19] conceded in their section 4.1 (pp. 11-12) that the atlas pleurocentrum, though fused to the dorsal surface of the axis intercentrum as usual for amniotes and diadectomorphs, was not fused to the axis pleurocentrum.

      This is correct, as we note in the MS. The issue is whether these elements are clearly identifiable.

      506-510: [19:12] did identify what they considered a possible ulnar patella, illustrated it (Figure 4d), scored it as unknown, and devoted the entire section 4.4 to it.<br /> 512-523: What I find most striking is that Whiteside et al., having just discovered a new taxon, feel so certain that this is the last one and any further material from that fissure must be referable to one of the species now known from there.

      We agree with these points and believe we have devoted adequate text to addressing them. Note that the reviewer does not recommend any revisions to these sections.

      553: Not that it matters, but I'm surprised you didn't use TNT 1.6; it came out in 2023 and is free like all earlier versions.

      We have kept this as is following the reviewer comment, and because we were interested in replicating the analyses in the previous publications that have contributed to the debate about the identity of this taxon.  For the present simple analyses both versions should perform identically, as the search algorithms for discrete characters are identical across these versions.

      562: Is "01" a typo, or do you mean "0 or 1"? In that case, rather write "0/1" or "{01}".

      This has been corrected to {01}

      (3) Comments on nomenclature and terminology

      55, 56: Delete both "...".

      This has been corrected.

      100: "ent- and ectepicondylar"

      For clarity, we have kept the full words.

      107-108: I understand that "high" is proximal and "low" is distal, but what is "the distal surface" if it is not the articular surface in the elbow joint?

      This has been corrected.

      120: "stem pan-lepidosaurs, and stem pan-squamates"; Lepidosauria and Squamata are crown groups that don't contain their stems

      This has been corrected.

      122, 123: Italics for Claudiosaurus and Delorhynchus.

      This has been corrected.

      130: Insert a space before "Tianyusaurus" (it's there in the original), and I recommend de-italicizing the two genus names to keep the contrast (as you did in line 162).

      This has been corrected.

      130, 131: Replace both "..." by "[...]", though you can just delete the second one.

      This has been corrected.

      174: Not a capitulum, but a grammatically even smaller (double diminutive) capitellum.

      This has been corrected.

      209, 224, Table 1: Both teams have consistently been doing this wrong. It's "recessus scalae tympani". The scala tympani ("ladder/staircase of the [ear]drum") isn't the recess, it's what the recess is for; therefore, the recess is named "recess of the scala tympani", and because there was no word for "of" in Classical Latin ("de" meant "off" and "about"), the genitive case was the only option. (For the same reason, the term contains "tympani", the genitive of "tympanum".)

      This has been corrected.

      415-425: This is a terminological nightmare. Ribs can have (and I'm not sure this is exhaustive): a) two separate processes (capitulum, tuberculum) that each bear an articulating facet, and a notch in between; b) the same, but with a non-articulating web of bone connecting the processes; c) a single uninterrupted elongate (even angled) articulating facet that articulates with the sutured or fused dia- and parapophysis; d) a single round articulating facet. Certainly, a) is bicapitate and d) is unicapitate, but for b) and c) all bets are off as to how any particular researcher is going to call them. This is a known source of chaos in phylogenetic analyses. I recommend writing a sentence or three on how the terms "unicapitate" & "bicapitate" lack fixed meanings and have caused confusion throughout tetrapod phylogenetics, and that the condition seen in Cryptovaranoides is nonetheless identical to that in archosauromorphs.

      This has been added: “This confusion in part stems from the lack of a fixed meaning for uni- and bicapitate rib heads; in any case, †C. microlanius possesses a condition identical to archosauromorphs as we have shown.”  (L.475-477, P.16).

      439-440: Other than in archosaurs, some squamates and Mesosaurus, in which sauropsids are dorsal intercentra absent?

      We are unclear about the relevance of the question to this section. The issue at hand is that some squamate lineages possess dorsal intercentra, so the absence of dorsal intercentra cannot be considered a squamate synapomorphy without the optimization of this feature along a phylogeny (which was not accomplished by Whiteside et al.).

      458: prezygapophyses.

      This has been corrected.

      516: "[...]".

      This has been corrected.

      566: synapomorphies.

      This has been corrected.

      587: Macrocnemus.

      This has been corrected.

      585: I strongly recommend either taking off and nuking the name Reptilia from orbit (like Pisces) or using it the way it is defined in Phylonyms, namely as the crown group (a subset of Neodiapsida). Either would mean replacing "neodiapsid reptiles" with "neodiapsids".

      This has been corrected to “neodiapsids.”

      625: Replace "inclusive clades" by "included clades", "component clades", "subclades", or "parts," for example.

      This has been kept as is because “inclusive clades” is common terminology and is used extensively in, for example, the PhyloCode. 

      659: Please update.

      References are updated.

      Fig. 8: Typo in Puercosuchus.

      This has been corrected.

      (4) Comments on style and spelling

      You inconsistently use the past and the present tense to describe [13, 19], sometimes both in the same sentence (e.g., lines 323 vs. 325). I recommend speaking of published papers in the past tense to avoid ascribing past views and acts to people in their present state.

      This has been corrected to be more consistent throughout the manuscript.

      48: Remove the second comma.

      This has been corrected.

      91: Replace "[13] and WEA24" by "[13, 19]".

      This has been corrected.

      100: Commas on both sides of "in fact" or on neither

      This has been corrected.

      117: I recommend "the interpretation in [19]". I have nothing against the abbreviation "WEA24", but you haven't defined it, and it seems like a remnant of incomplete editing. - That said, eLife does not impose a format on such things. If you prefer, you can just bring citation by author & year back; in that case, this kind of abbreviation would make perfect sense (though it should still be explicitly defined).<br /> 129, 145: Likewise.

      We have modified this [13] and [19] where necessary.

      192-198: Surely this should be made part of the paragraph in lines 158-175, which has the exact same headline?

      This has been corrected.

      200-206: Surely this should be made part of the paragraph in lines 148-156, which has the exact same headline?

      These sections deal with different issues pertaining to the analyses of Whiteside et al. (2024) and so we have kept to organization as is.

      214: Delete "that".

      This has been deleted.

      312: "Vomer" isn't an adjective; I'd write "main vomer body" or "vomer's main body" or "main body of the vomer".

      This has been corrected.

      350: "figured"

      This has been corrected.

      400: Rather, "rearticulated" or "worked to rearticulate"? - And why "several"? Just write "two". "Several" implies larger numbers.

      These issues have been corrected.

      448, 500: As which? As what kind of feature? I'm aware that "as such" is fairly widely used for "therefore", but it still confuses me every time, and I have to suspect I'm not the only one. I recommend "therefore" or "for this reason" if that is what you mean.

      “As such” has been deleted.

      452: Adobe Reader doesn't let me check, but I think you have two spaces after "of".

      This has been corrected.

      514, 539, 546, 552, 588, Fig. 3, 5, 6, Table 1: "WEA24" strikes again.

      This has been corrected.

      515: Remove the parentheses.

      This has been corrected.

      531: Insert a space after the period.

      This has been corrected.

      532: Remove both commas and the second "that".

      This has been corrected.

      538: Remove the comma.

      This has been kept as is because changing it would render the sentence grammatically incorrect.

      545: "[...]" or, better, nothing.

      This has been corrected.

      547: Spaces on both sides of the dash or on neither (as in line 553).

      This has been corrected.

      552: Rather, "conducted a parsimony analysis".

      This has been corrected.

      556: Space after "[19]".

      This has been corrected.

      560: Comma after "narrow".

      This has been corrected.

      600: Comma after "above" to match the one in the preceding line - there's an insertion in the sentence that must be flanked by commas on both sides.

      This has been corrected.

      603: Compound adjectives like "alpha-taxonomic" need a hyphen to avoid tripping readers up.

      This has been corrected.

      612: Similarly, "ancestral-state reconstruction" needs one to make immediately clear it isn't a state reconstruction that is ancestral but a reconstruction of ancestral states.

      This has been corrected.

      613: If you want to keep this comma, you need to match it with another after "Cryptovaranoides" in line 611.

      We have kept this as is, because removing this comma would render the sentence grammatically incorrect.

      615: Likewise, you need a comma after "and" because "except for a few features" is an insertion. The other comma is actually optional; it depends on how much emphasis you want to place on what comes after it.

      this has been added.

      622: Comma after "[48, 49]".

      this has been added.

      672: Missing italics and two missing spaces.

      This has been corrected.

      678, 680-681, 693, 700-701, 734, 742, 747, 788, 797, 799, 803, 808, 810-811, 814, 817, 820, 823, 828, 841, 843: Missing italics.

      This has been corrected.

      683, 689: These are book chapters. Cite them accordingly.

      This has been corrected.

      737: Missing DOI.

      No DOI is available.

      793: Missing Bolosaurus major; and I'd rather cite it as "2024" than "in press", and "online early" instead of "n/a".

      This has been corrected.

      835: Hoffstetter, RJ?

      This has been corrected.

      836: Is there something missing?

      This has been corrected.

      839: This is the same reference as number 20 (lines 683-684), and it is miscited in a different way...!

      This has been corrected.

      Reviewer #2 (Recommendations for the authors):

      (1) There is a brief mention of a phylogenetic analysis being re-run, but it is unclear if any modifications (changes in scoring) based on the very observations were made. Please state this explicitly.

      This is explained from lines 600-622, P.20-21, in the section “Apomorphic characters not empirically obtained.”  "In order to check the characters listed by Whiteside et al. [19] (p.19) as “two diagnostic characters” and “eight synapomorphies” in support of a squamate identity for †Cryptovaranoides, we conducted a parsimony analysis of the revised version of the dataset [32] provided by Whiteside et al. [19] in TNT v 1.5 [91]. We used Whiteside et al.’s [19] own data version"

      (2) Line 20: There is almost no discussion of non‑lepidosaur lepidosauromorphs. I suggest including this, as the archosauromorph‑like features reported in Cryptovaranoides appear rather plastic. Furthermore, diagnostic features of Archosauromorpha in other datasets (e.g., Ezcurra 2016 or the works of Spiekman) are notably absent (and unsampled) in Cryptovaranoides. Expanding this comparison would greatly strengthen the manuscript.

      The brief discussion (although not absent) of non-lepidosaur lepidosauromorphs is largely a function of the poor fossil record of this grade. But where necessary, we do discuss these taxa. Also see our previous study (Brownstein et al. 2023) for an extensive discussion of characters relevant to archosauromorphs.

      (3) Line 38: I suggest removing "Archosauromorpha" from the keywords. The authors make a compelling case that Cryptovaranoides is not a squamate, yet they do not fully test its placement within Archosauromorpha (as they acknowledge). Perhaps use "Reptilia" instead?

      We have removed this keyword.

      (4) Line 99: The authors' points here are well made and largely valid. The presence of the ent‑ and ectepicondylar foramina is indeed an amniote plesiomorphy and cannot confirm a squamate identity. Their absence, however, can be informative - although it is unclear whether the CT scans of the humerus are of sufficient resolution, and Figure 4 of Brownstein et al. looks hastily reconstructed (perhaps owing to limited resolution). Moreover, the foramina illustrated by Whiteside do resemble those of other reptiles, albeit possibly over‑prepared and exaggerated.

      The issue with the noted figure is indeed due to poor resolution from the scans. Although we agree with the reviewer, we hesitate to talk about absence in this taxon being phylogenetically informative given the confounding influence of ontogeny.

      (5) I encourage the authors to provide slice data to support the claim that the foramina are absent (which could certainly be correct!); otherwise, the assertion remains unsubstantiated.

      We only have access to the mesh files of segmented bones, not the raw (reconstructed slice) data.

      (6) PLEASE NOTE - because the specimen is juvenile, the apparent absence of the ectepicondylar foramen is equivocal: the supinator process develops through ontogeny and encloses this foramen (see Buffa et al. 2025 on Thadeosaurus, for example).

      See above.

      (7) Line 122: Italicize 'Delorhynchus'

      This has been corrected.

      (8) Lines 131‑132: I'd suggest deleting the final sentence; it feels a little condescending, and your argument is already persuasive.

      This has been corrected.

      (9) Line 129: Please note that owenettid "parareptiles" also lack this process, as do several other stem‑saurians. Its absence is therefore not diagnostic of Squamata.<br /> Also: Such plasticity is common outside the crown. Milleropsis and Younginidae develop this process during ontogeny, even though a lower temporal bar never fully forms.

      We appreciate this point. See discussion later in the manuscript.

      (11) Line 172: Consider adding ontogeny alongside taphonomy and preservation. A juvenile would likely have a poorly developed radial condyle, if any. Acknowledging this possibility will add some needed nuance.

      This sentence has been modified, but we have not added in discussion of ontogeny here because it is not immediately relevant to refuting the argument about inference of the presence of this feature when it is not preserved.

      (12) Line 177: The "septomaxilla" in Whiteside et al. (2024, Figure 1C) resembles the contralateral premaxilla in dorsal view, with the maxillary process on the left and the palatal (or vomerine) process on the right (the dorsal process appears eroded). The foramen looks like a prepalatal foramen, common to many stem and crown reptiles. Consequently, scoring the septomaxilla as absent may be premature; this bone often ossifies late. In my experience with stem‑reptile aggregations, only one of several articulated individuals may ossify this element.

      We agree that presence of a late-ossifying septomaxilla cannot be ruled out, but our point remains (and in agreement with Referee) that scoring the septomaxilla as present based on the amorphous fragments is premature.

      (13) Line 200: Tomography data should be shown before citing it. The posterior margin of the maxilla appears rather straight, and the maxilla itself is tall for an archosauromorph. It would be more convincing to score this feature as present only after illustrating the relevant slices - and, as you note, the trait is widespread among non‑archosauromorphs.

      See above and Brownstein et al. (2023).

      (14) Line 208: Well argued: how could Whiteside et al. confidently assign a disarticulated element? Their "vagus" foramen actually resembles a standard hypoglossal foramen - identical to that seen in many stem reptiles, which often have one large and one small opening.

      Thank you!

      (15) Line 248: Again, please illustrate this region. One cannot argue for absence without showing the slice data. Note that millerettids and procolophonians - contemporaneous with Cryptovaranoides - possess an enclosed vidian canal, so the feature is broadly distributed.

      See above.

      (16) Line 258: The choanal fossa is intriguing: originally created for squamate matrices, yet present (to varying degrees) in nearly every reptile I have examined. It is strongly developed in millerettids (see Jenkins et al. 2025 on Milleropsis and Milleretta) and younginids, much like in squamates - Tiago appropriately scores it as present. Thus, it may be more of a "Neodiapsida + millerettids" character. In any case, the feature likely forms an ordered cline rather than a simple binary state.

      We agree and look forward to future study of this feature.

      (17) Line 283: Bolosaurids are not diapsids and, per Simões, myself, and others, "Diapsida" is probably invalid, at least how it is used here. Better to say "neodiapsids" for choristoderes and "stem‑reptiles" or "sauropsids" for bolosaurids. Jenkins et al.'s placement is largely a function of misidentifying the bolosaurid stapes as the opisthotic.

      We are not entirely clear on this point since bolosaurids are not mentioned in this section.

      (18) Line 298: Here, you note that the CT scans are rather coarse, which makes some earlier statements about absence/presence less certain (e.g., humeral foramina). It may strengthen the paper to make fewer definitive claims where resolution limits interpretation.

      We appreciate this point. However, in the case of the humeral foramina the coarseness of the scans is one reason why we question Whiteside et al. scoring of the presence of these features.

      (19) Line 314: Multiple rows of vomerine teeth are standard for amniotes; lepidosauromorphs such as Paliguana and Megachirella also exhibit them (though they may not have been segmented in the latter's description). Only a few groups (e.g., varanopids, some millerettids) have a single medial row.

      We appreciate this point and have added in those citations into the following added sentence: “Multiple rows of vomerine teeth are common in reptiles outside of Squamata [76]; the presence of only one row is restricted to a handful of clades, including millerettids [77,78], †Tanystropheus [49], and some [79], but not all [71,80] choristoderes.” (L. 360-363, P. 12).

      (20) Line 317: This is likely a reptile plesiomorphy - present in all millerettids (e.g., Milleropsis and Milleretta per Jenkins et al.). Citing these examples would clarify that it is not uniquely squamate. Could it be secondarily lost in archosauromorphs?

      We appreciate this point and have cited Jenkins et al. here. It is out of the scope of this discussion to discuss the polarity of this feature relative to Archosauromorpha.

      (21) Line 336: Unfortunately, a distinct quadratojugal facet is usually absent in Neodiapsids and millerettids; where present, the quadratojugal is reduced and simply overlaps the quadrate.

      We appreciate this point but feel that reviewing the distribution of this feature across all reptiles is not relevant to the text noted.

      (22) Line 357: Pterygoid‑quadrate overlap is likely a tetrapod plesiomorphy. Whiteside et al. do not define its functional or phylogenetic significance, and the overlap length is highly variable even among sister taxa.

      We agree, but in any case this feature is impossible to assess in Cryptovaranoides.

      (23) Line 365: Another well‑written section - clear and persuasive.

      Thank you!

      (24) Line 385: The cephalic condyle is widespread among neodiapsids, so it is not uniquely squamate.

      We agree.

      (25) Character 391: Note that the frontal underlapping the parietal is widespread, appearing in both millerettids and neodiapsids such as Youngina.

      We appreciate this point, but the point here deals with the fact that this feature is not observable in the holotype of Cryptovaranoides.

      (26) Line 415: The "anterior process" is actually common among crown reptiles, including sauropterygians, so it cannot by itself place Cryptovaranoides within Archosauromorpha.

      We agree but also note that we do not claim this feature unambiguously unites Cryptovaranoides with Archosauromorpha.

      (28) Line 460: Yes - Whiteside et al. appear to have relabeled the standard amniote coracoid foramen. Excellent discussion.

      Thank you!

      (29) Line 496: While mirroring Whiteside's structure, discussing this mandibular character earlier, before the postcrania, might aid readability.

      We have chosen to keep this structure as is.

      (30) Lines 486-588: This section oversimplifies the quadrate articulation.

      We are unclear how this is an oversimplification.

      (31) Both Prolacerta and Macrocnemus possess a cephalic condyle and some mobility (though less than many squamates). In Prolacerta (Miedema et al. 2020, Figure 4), the squamosal posteroventral process loosely overlaps the quadrate head.

      We assume this comment refers to the section "Peg-in-notch articulation of quadrate head"; we appreciate clarification that this feature occurs in variable extent outside squamates, but this does not affect our statement that the material of Cryptovaranoides is too poorly preserved to confirm its presence.

      (32) Where is this process in Cryptovaranoides? It is not evident in Whiteside's segmentation of the slender squamosal - please illustrate.

      We are unclear as to which section this comment refers.

      (33) Additionally, the quadrate "conch" of Cryptovaranoides is well developed, bearing lateral and medial tympanic crests; the lateral crest is absent in the cited archosauromorphs.

      We note that no vertebrate has a medial tympanic crest (it is always laterally placed for the tympanic membrane, when present). If this is what the reviewer refers to, this is a feature commonly found across all tetrapods bearing a tympanum attached to the quadrate (e.g., most reptiles), and so it is not very relevant phylogenetically. Regarding its presence in Cryptovaranoides, the lateral margin of the quadrate is broken (Brownstein et al., 2023), so it cannot be determined. This incomplete preservation also makes an interpretation of a quadrate conch very hard to determine. But as currently preserved, there is no evidence whatsoever for this feature.

      (34) Line 591: The cervical vertebrae of Cryptovaranoides are not archosauromorph‑like. Archosauromorph cervicals are elongate, parallelogram‑shaped, and carry long cervical ribs-none of which apply here. As the manuscript lacks a phylogenetic analysis, including these features seems unnecessary. Should they be added to other datasets, I suspect Cryptovaranoides would align along the lepidosaur stem (though that remains to be tested).

      We politely disagree. The reviewer here mentions that the cervical vertebrae of archosauromorphs are generally shaped differently from those in Cryptovaranoides. The description provided (“elongate, parallelogram‑shaped, and carry long cervical ribs-none”) is basically limited to protorosaurians (e.g., tanystropheids, Macrocnemus) and early archosauriforms. We note that archosauromorph cervicals are notoriously variable in shape, especially in the crown, but also among early archosauromorphs. Further, the cervical ribs, are notoriously similar among early archosauromorphs (including protorosaurians) and Cryptovaranoides, as discussed and illustrated in Brownstein et al., 2023 (Figs. 2 and 3), especially concerning the presence of the anterior process.

      Further, we do include a phylogenetic analysis of the matrix provided in Whiteside et al. (2024) as noted in our results section. In any case, we direct the reviewer to our previous study (Brownstein et al., 2023), in which we conduct phylogenetic analyses that included characters relevant to this note.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors should use specimen numbers all over the text because we are talking about multiple individuals, and the authors contest the previous affinity of some of them. For example, on page 16, line 447, they mention an isolated vertebra but without any number. The specimen can be identified in the referenced article, but it would be much easier for the reader if the number were also provided here

      Agreed and added.

      (2) Abstract: "Our team questioned this identification and instead suggested Cryptovaranoides had unclear affinities to living reptiles."

      That is very imprecise. The team suggested that it could be an archosauromorph or an indeterminate neodiapsid. Please change accordingly.

      We politely disagree. We stated in our 2023 study that whereas our phylogenetic analyses place this taxon in Archosauromorpha, it remains unclear where it would belong within the latter. This is compatible with “unclear affinities to living reptiles”.

      (3) Page 7, line 172: "Taphonomy and poor preservation cannot be used to infer the presence of an anatomical feature that is absent." Unfortunate wording. Taphonomy always has to be used to infer the presence or absence of anatomical features. Sometimes the feature is not preserved, but it leaves imprints/chemical traces or other taphonomic indicators that it was present in the organism. Please remove or rewrite the sentence.

      We agree and have modified the sentence to read: “Taphonomy and poor preservation cannot be used alone to justify the inference that an anatomical feature was present when it is not preserved and there is no evidence of postmortem damage. In a situation when the absence of a feature is potentially ascribable to preservation, its presence should be considered ambiguous.” (L. 141-145, P.5).

      (4) Page 4, line 91, please explain "WEA24" here, though it is unclear why this abbreviation is used instead of citation in the manuscript.

      This has been corrected to Whiteside et al. [19].

      (5) Page 6, line 144: "Together, these observations suggest that the presence of a jugal posterior process was incorrectly scored in the datasets used by WEA24 (type (ii) error)." That sentence is unclear. Why did the authors use "suggest"? Does it mean that they did not have access to the original data matrix to check it? If so, it should be clearly stated at the beginning of the manuscript.

      See earlier; this has been modified and “suggest” has been removed.

      (6) Page 7, line 174: "Finally, even in the case of the isolated humerus with a preserved capitulum, the condyle illustrated by Whiteside et al. [19] is fairly small compared to even the earliest known pan-squamates, such as Megachirella wachtleri (Figure 4)." Figure 4 does not show any humeri. Please correct.

      The reference to figure 4 has been removed.

      (7) Page 8, line 195-198: "This is not the condition specified in either of the morphological character sets that they cite [18,38], the presence of a distinct condyle that is expanded and is by their own description not homologous to the condition in other squamates." This is a bit unclear. Could the authors explain it a little bit further? How is the condition that is specified in the referred papers different compared to the Whiteside et al. description?

      We appreciate this comment and have broken this sentence up into three sentences to clarify what we mean:

      “The projection of the radial condyle above the adjacent region of the distal anterior extremity is not the condition specified in either of the morphological character sets that Whiteside et al. [19] cite [18,32]. The condition specified in those studies is the presence of a distinct condyle that is expanded. The feature described in Whiteside et al. [19] does not correspond to the character scored in the phylogenetic datasets.” (L.220-225, P.8).

      (8) Page 16, line 446: "they observed in isolated vertebrae that they again refer to C. microlanius without justification". That is not true. The referred paper explains the attribution of these vertebrae to Cryptovaranoides (see section 5.3 therein). The authors do not have to agree with that justification, but they cannot claim that no justification was made. Please correct it here and throughout the text.

      We have modified this sentence but note that the justification in Whiteside et al. (2024) lacked rigor. Whiteside et al. (2024) state: “Brownstein et al. [5] contested the affinities of three vertebrae, cervical vertebra NHMUK PV R37276, dorsal vertebra NHMUK PV R37277 and sacral vertebra NHMUK PV R37275. While all three are amphicoelous and not notochordal, the first two can be directly compared to the holotype. Cervical vertebra NHMUK PV R37276 is of the same form as the holotype CV3 with matching neural spine, ventral keel (=crest) and the posterior lateral ridges or lamina (figure 3c,d) shown by Brownstein et al. [5, fig. 1a]. The difference is that NHMUK PV R37276 has a fused neural arch to the pleurocentrum and a synapophysis rather than separate diapophysis and parapophysis of the juvenile holotype (figure 3c). Neurocentral fusion of the neural arch and centrum can occur late in modern squamates, ‘up to 82% of the species maximum size’ [28].

      The dorsal surface of dorsal vertebra NHMUK PV R37277 (figure 3e) can be matched to the mid-dorsal vertebra in the †Cryptovaranoides holotype (figure 4d, dor.ve) and has the same morphology of wide, dorsally and outwardly directed, prezygapophyses, downwardly directed postzygapophyses and similar neural spine. It is also of similar proportions to the holotype when viewed dorsally (figures 3e and 4d), both being about 1.2 times longer anteroposteriorly than they are wide, measured across the posterior margin. The image in figure 4d demonstrates that the posterior vertebrae are part of the same spinal column as the truncated proximal region but the spinal column between the two parts is missing, probably lost in quarrying or fossil collection.”

      This justification is based on pointing out the presence of supposed shared features between these isolated vertebrae and those in the holotype of Cryptovaranoides, even though none of these features are diagnostic for that taxon. We have changed the sentence in our manuscript to read:

      “Whiteside et al. [19] concur with Brownstein et al. [18] that the diapophyses and parapophyses are unfused in the anterior dorsals of the holotype of †Cryptovaranoides microlanius, and restate that fusion of these structures is based on the condition they observed in isolated vertebrae that they refer to †C. microlanius based on general morphological similarity and without reference to diagnostic characters of †C. microlanius” (L. 502-507, P. 17).

      (9) Figure 2. The figure caption lacks some explanations. Please provide information about affinity (e.g., squamate/gekkotan), ag,e and locality of the taxa presented. Are these left or right palatines? The second one seems to be incomplete, and maybe it is worth replacing it with something else?

      The figure caption has been modified:

      “Figure 2. Comparison of palatine morphologies. Blue shading indicates choanal fossa. Top image of †Cryptovaranoides referred left palatine is from Whiteside et al. [19]. Middle is the left palatine of †Helioscopos dickersonae (Squamata: Pan-Gekkota) from the Late Jurassic Morrison Formation [62]. Bottom is the right palatine of †Eoscincus ornatus (Squamata: Pan-Scincoidea) from the Late Jurassic Morrison Formation [31].”

      (10) Figure 8. The abbreviations are not explained in the figure caption.

      These have been added.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1:

      The manuscript is significantly improved, as also indicated by Reviewer 2, with the 100% formation of the PHF and the additional experiments to elucidate on the potential mechanism by the PTMs. This is a great work.

      Reviewer #2:

      One (minor) issue I do still have is how confusingly the NMR data are presented. Although the authors revised Figure 6 and added labels to the HSQCs etc., this figure and its supplements are still very hard to understand. I think this can be easily fixed by highlighting in the figures and also figure captions which changes/differences the reader is supposed to appreciate and why. 

      We have added labelling to Figure 6 and extended the legends to its Supplements.

      After our fist revision, the level of evidence in the eLife assessment was described as convincing. In our opinion the results in this paper, which include 11 cryo-EM data sets and NMR experiments on 6 tau constructs among other data, provide a level of evidence that extends beyond the state-of-the-art in the field.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      Activation of thermogenesis by cold exposure and dietary protein restriction are two lifestyle changes that impact health in humans and lead to weight loss in model organisms - here, in mice. How these affect liver and adipose tissues has not been thoroughly investigated side by side. In mice, the authors show that the responses to methionine restriction and cold exposure are tissue-specific, while the effects on beige adipose are somewhat similar.

      Strengths: 

      The strength of the work is the comparative approach, using transcriptomics and bioinformatic analyses to investigate the tissue-specific impact. The work was performed in mouse models and is state-of-the-art. This represents an important resource for researchers in the field of protein restriction and thermogenesis. 

      Weaknesses: 

      The findings are descriptive, and the conclusions remain associative. The work is limited to mouse physiology, and the human implications have not been investigated yet.

      We thank Reviewer 1 for their thoughtful review and for highlighting the strength of our comparative, tissue-specific analyses. We acknowledge that our study is descriptive and limited to mouse physiology, and agree that translation to humans will be an important next step. By making these data broadly accessible, we aim to provide a useful resource for future mechanistic and translational studies on dietary amino acid restriction and thermogenesis.

      Reviewer #2 (Public review): 

      Summary: 

      This study provides a library of RNA sequencing analysis from brown fat, liver, and white fat of mice treated with two stressors - cold challenge and methionine restriction - alone and in combination (interaction between diet and temperature). They characterize the physiologic response of the mice to the stressors, including effects on weight, food intake, and metabolism. This paper provides evidence that while both stressors increase energy expenditure, there are complex tissue-specific responses in gene expression, with additive, synergistic, and antagonistic responses seen in different tissues.

      Strengths: 

      The study design and implementation are solid and well-controlled. Their writing is clear and concise. The authors do an admirable job of distilling the complex transcriptome data into digestible information for presentation in the paper. Most importantly, they do not overreach in their interpretation of their genomic data, keeping their conclusions appropriately tied to the data presented. The discussion is well thought out and addresses some interesting points raised by their results.

      Weaknesses: 

      The major weakness of the paper is the almost complete reliance on RNA sequencing data, but it is presented as a transcriptomic resource.

      We thank Reviewer 2 for their positive evaluation of our study and for highlighting the strengths of our design, analyses, and interpretation. We acknowledge the limitation of relying primarily on RNA-seq, and emphasize that our intent was to provide a comprehensive transcriptomic resource to guide future mechanistic work by the community.

      Reviewer #3 (Public review): 

      Summary: 

      Ruppert et al. present a well-designed 2×2 factorial study directly comparing methionine restriction (MetR) and cold exposure (CE) across liver, iBAT, iWAT, and eWAT, integrating physiology with tissue-resolved RNA-seq. This approach allows a rigorous assessment of where dietary and environmental stimuli act additively, synergistically, or antagonistically. Physiologically, MetR progressively increases energy expenditure (EE) at 22{degree sign}C and lowers RER, indicating a lipid utilization bias. By contrast, a 24-hour 4 {degree sign}C challenge elevates EE across all groups and eliminates MetR-Ctrl differences. Notably, changes in food intake and activity do not explain the MetR effect at room temperature.

      Strengths: 

      The data convincingly support the central claim: MetR enhances EE and shifts fuel preference to lipids at thermoneutrality, while CE drives robust EE increases regardless of diet and attenuates MetR-driven differences. Transcriptomic analysis reveals tissue-specific responses, with additive signatures in iWAT and CE-dominant effects in iBAT. The inclusion of explicit diet×temperature interaction modeling and GSEA provides a valuable transcriptomic resource for the field.

      Weaknesses: 

      Limitations include the short intervention windows (7 d MetR, 24 h CE), use of male-only cohorts, and reliance on transcriptomics without complementary proteomic, metabolomic, or functional validation. Greater mechanistic depth, especially at the level of WAT thermogenic function, would strengthen the conclusions.

      We thank Reviewer 3 for their thorough review and for recognizing the strengths of our factorial design, physiological assessments, and transcriptomic analyses. We acknowledge the limitations of short intervention windows, male-only cohorts, and the reliance on transcriptomics. Our aim was to generate a well-controlled comparative dataset as a resource, and we agree that future work incorporating longer interventions, both sexes, and additional mechanistic layers will be important to build on these findings.

      Reviewer #1 (Recommendations for the authors): 

      In my opinion, the comparative analysis between tissues and treatments could be expanded.

      We thank the reviewer for this suggestion. We included top30 DEG heatmaps for the comparison MetR_CEvsCtrl_RT for up and downregulated genes in the figures for each tissue. We also provide additional data in the supplementary, including top30 heatmaps for Ctrl_CEvsCtrl_RT, MetR_RTvsCtrl_RT, the interaction term, as well as one excel sheet per tissue for all DEGs (p<0.05 and FC +/- 1.5 and for all gene sets (GSEA).

      Reviewer #3 (Recommendations for the authors): 

      (1) CE robustly increases food intake, yet MetR mice at room temperature, despite elevated EE, do not appear to increase feeding to maintain energy balance. The authors should discuss this discrepancy, as it represents an intriguing avenue for follow-up.

      See answer below.

      (2) CE raises EE to ~0.9 kcal/h irrespective of diet, suggesting that the additive weight loss seen with MetR+CE (Fig. 1H) must be due to reduced intake. This raises the possibility that MetR mice fail to appropriately sense negative energy balance, even under CE, and do not compensate with higher feeding. 

      We thank the reviewer for comments 1 and 2. We did not put an emphasis on this finding, as the literature on the effects on food intake under sulfur amino acid restriction are very inconsistent. Intial studies (e.g. by Gettys group) most often report on food intake per gram bodyweight and report an increase in caloric intake. We think that this reporting is flawed and should rather be reported as cumulative food intake. The recent paper by the Dixit group also reports that there is no effect on food intake, in line with our data. The recent paper by the Nudler group reports a decrease in food intake.

      (3) Report effect sizes and sample sizes alongside p-values in all figure panels, and ensure the GEO accession (currently listed as "GSEXXXXXX") is provided.

      We thank the reviewer for noticing this. So far we were unable to upload the datasets to GEO. We’re unable to connect to the NIH servers, presumably due to the US government shutdown. We are commited to sharing this dataset as soon as possible and will update the manuscript in the future accordingly. We included the sample size for experiment 1 and 2 in the figure legends and described our outlier detection method in the methods section. Significances are explained in the figure legends.

      (4) Explicitly define the criteria for "additive," "synergistic," and "antagonistic" interactions (both at the gene and pathway levels) to help readers align the text with the figures.

      We thank the reviewer for this helpful comment. We added an description of how we defined and computed the regulatory logic in the method section.

      (5) Revise the introduction to address recent data from the Dixit group (ref. #38), which shows that EE induced by cysteine restriction and weight loss is independent of FGF21 and UCP1. As written, the introduction states: "Recent studies have shown that DIT via dietary MetR augments energy expenditure in a UCP1-dependent...fashion". 

      See answer below.

      (6) "Mechanistically, MetR...results in secretion of FGF21. In turn, FGF21 augments EE by activating UCP1-driven thermogenesis in brown adipose tissue via β-adrenergic signaling (4,7)." This should be updated for accuracy and balance.

      We thank the reviewers for both comments 5 and 6. Both recent publications by the Dixit and the Nudler groups (now ref 9 and 10) provide very interesting further mechanistic detail into the bodyweight loss in response to dietary sulfur amino acid restriction. However, there are also older papers by the Gettys group that in part contradict their findings, particularly, when it comes to the importance of UCP1 for the adaptation to sulfur amino acid restriction. Overall, we think that further work is required to determine the importance of UCP1-driven EE from alternative mechanisms that ultimately drive body and fat mass loss. We rewrote the referenced paragraph in the introduction to reflect this.

    1. Author response:

      We wish to thank the reviewers and the editors for their careful evaluation of our article and for their valuable input that we will embrace to strengthen our article. We will still respond in full when we have had time to perform further analyses, which we anticipate will corroborate our main conclusions and make our article more comprehensive. 

      For now, we provide a provisional response to the major points brought forward by both the editorial summary and the public reviews. As we understood, the two main points that were raised regard: (1) the novelty and, accordingly, the theoretical importance of our work and (2) the (in)completeness of our results. We provide our provisional response to both of these points below.

      Novelty and theoretical relevance of the work

      Regarding the novelty of our work, we believe the reviews—and, by extension, the editorial summary— underappreciated the main theoretical value of the question we addressed. Our work set out to investigate whether microsaccades track covert attentional shifting, attentional maintenance, or both. We fully recognise that there are ample prior studies that investigated and reported a link between microsaccades and covert attention, but also underscore how other studies report seemingly contradicting evidence by reporting that there is no such link. One such example is a recent high profile paper by Willett & Mayo in PNAS (2023). Prompted by the recent hypothesis that this seemingly conflicting evidence may be due to prior work investigating attention ‘in di erent stages’ (van Ede, PNAS, 2023), we set out to address precisely this using a dedicated task that we designed for this purpose. As acknowledged by the summary and public reviews, this helps to reconcile seemingly opposing views in the literature. In our view, such reconciliation has substantial theoretical value.

      While we appreciate that our reported insights may resonate and appear plausible to those working on this topic, we are not aware of any prior studies that directly addressed whether the link between covert attention and microsaccades may fundamentally depend on the ‘stage’ of attentional deployment (‘shift’ vs. ‘maintain’). 

      To fill this key gap and address this timely issue, we developed a dedicated experiment designed to evaluate the relationship between microsaccades and the di erent stages of attention within a single paradigm. We did so by varying the cue-target intervals to uniquely incentivise early shifting (by having short intervals), while also being able to assess microsaccade biases during subsequent maintenance (in the longer trials). To our knowledge, no previous task has jointly examined these components in this manner. Moreover, our inclusion of two widely adopted approaches to fixational control provides yet another source of novelty. Together, we believe that these features position our work as a substantive advance that reconciles seemingly opposing theoretical views.

      Completeness of results

      Regarding the completeness of our results, the editorial summary points to “the absence of independent measures, single-trial analyses, and neutral-condition controls needed to substantiate the central claims”. In our view, while the raised points are valuable, they pertain to issues that are tangential to our primary question and stem from misunderstandings of key analytical choices. We consider our results complete and comprehensive with regards to the main question our studies set out to answer. We briefly clarify each of the raised points below, and will respond more elaborately as part of our forthcoming revision.

      First, regarding the portrayed “need” for independent measures to define the ‘shift window’ of interest, we wish to clarify how our main analysis is completely agnostic to predetermined time windows, as we employ a cluster-based permutation approach to assess our rich time-resolved data across the full time axis. For the complementary analyses that address the ‘shift’ and ‘maintain’ windows more directly, we use a priori defined windows that are based on ample prior literature (from prior literature studying microsaccade biases, as well as from prior literature on the time course of top-down attention as studied through SOA manipulations). Accordingly, even these ‘zoomed in’ analyses rely on time windows that are empirically grounded in ample prior research. 

      Second, regarding the use of single-trial analyses, we want to emphasise that single-trial predictability is not where our theoretical question resides. We start from the perspective that the relationship between covert visual-spatial attention and microsaccades is inherently probabilistic. Our aim is not to address or question this. Rather, our aim is to determine whether this probabilistic relationship behaves similarly during attentional shifting and maintenance—an issue our analyses directly and appropriately address. In addition, we also explicitly discuss how the link between microsaccades and attention is fundamentally probabilistic at the single-trial level in our discussion, and prompted by the valuable feedback, we plan to expand on this important contextualisation as part of our revision.

      Finally, regarding the portrayed “need” for a neural-attention control condition, we agree that inclusion of a neutral attention condition could be informative for disentangling the ‘benefits’ versus ‘costs’ of attentional cueing. However, such disambiguation is tangential to our central aim. Rather, our behavioural data primarily serve to verify attentional ‘allocation’ at later cue-target intervals. Observing a di erence between valid and invalid cues su          ices for this central aim. We also note how inclusion of a neutral condition would have reduced trial-numbers and statistical power for our critical conditions of interest. Accordingly, we do not see this as a limitation that in any way challenges our main conclusions. Prompted by this reflection, during our revision we will ensure to not mention selective ‘benefits’ or ‘costs’ of our cueing manipulation, but to refer to ‘the presence of an attentional modulation’ instead. 

      Therefore, we believe that the explicit design and analysis choices that we made aligned with the theoretical aims of our study, and that our data provide a complete and coherent test of our central question. The raised points are valuable and we will leverage them to improve our article, but they do not render our findings “incomplete” (as currently portrayed) with regards to the key goal of our article.

      Future changes

      Naturally, we consider the feedback from the editors and the reviewers of great value, and we will incorporate their suggestions to further strengthen our article. Concretely, we plan to implement the following revisions:

      • In our introduction we plan to elaborate on the prior state of knowledge to provide a more complete context.

      • We plan to add precise clarifications throughout the paper, ranging from methodological details and methodological choices to interpretation of the results. This should increase the comprehensiveness and transparency of our article.

      •  We will run and incorporate the outcomes of various additional analyses that we anticipate will further substantiate our conclusions and provide a more comprehensive view of our data and key findings.

      We are confident that these revisions will enhance clarity and accessibility while reinforcing the theoretical contributions of the work.

      References

      Willett, S. M., & Mayo, P. J. (2023). Microsaccades are directed toward the midpoint between targets in a variably cued attention task. Proceedings of the National Academy of Sciences of the United States of America, 120(20). https://doi.org/10.1073/pnas.2220552120

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The authors report intracranial EEG findings from 12 epilepsy patients performing an associative recognition memory task under the influence of scopolamine. They show that scopolamine administered before encoding disrupts hippocampal theta phenomena and reduces memory performance, and that scopolamine administered after encoding but before retrieval impairs hippocampal theta phenomena (theta power, theta phase reset) and neural reinstatement but does not impair memory performance. This is an important study with exciting, novel results and translational implications. The manuscript is well-written, the analyses are thorough and comprehensive, and the results seem robust.

      Strengths:

      (1) Very rare experimental design (intracranial neural recordings in humans coupled with pharmacological intervention).

      (2) Extensive analysis of different theta phenomena.

      (3) Well-established task with different conditions for familiarity versus recollection.

      (4) Clear presentation of findings and excellent figures.

      (5) Translational implications for diseases with cholinergic dysfunction (e.g., AD).

      (6) Findings challenge existing memory models, and the discussion presents interesting novel ideas.

      Weaknesses:

      (1) One of the most important results is the lack of memory impairment when scopolamine is administered after encoding but before retrieval (scopolamine block 2). The effect goes in the same direction as for scopolamine during encoding (p = 0.15). Could it be that this null effect is simply due to reduced statistical power (12 subjects with only one block per subject, while there are two blocks per subject for the condition with scopolamine during encoding), which may become significant with more patients? Is there actually an interaction effect indicating that memory impairment is significantly stronger when scopolamine is applied before encoding (Figure 1d)? Similar questions apply to familiarity versus recollection (lines 78-80). This is a very critical point that could alter major conclusions from this study, so more discussion/analysis of these aspects is needed. If there are no interaction effects, then the statements in lines 84-86 (and elsewhere) should be toned down.

      The reviewer highlights important concerns regarding the statistical power of the behavioral effects. We address these concerns in the revised manuscript in two ways: (1) we provide a supplemental analysis using a matched number of blocks between the placebo and scopolamine conditions to avoid statistical bias related to differing trial counts, and (2) we include a supplemental figure illustrating paired comparisons between blocks.

      (2) Further, could it simply be that scopolamine hadn't reached its major impact during retrieval after administration in block 2? Figure 2e speaks in favor of this possibility. I believe this is a critical limitation of the experimental design that should be discussed.

      The reviewer raises an important methodological concern regarding the time required for scopolamine's effect to manifest and the subsequent impact on the study outcomes. Previous studies report that the average time to maximum serum concentration after intravenous (IV) scopolamine administration is approximately 5 minutes (Renner et al., 2005), with the corresponding clinical onset estimated at 10 minutes. In our study, the retrieval period in Block 2 commenced at 15 ± 0.2 post-injection across all subjects. Given this timing, there is sufficient reason to conclude that scopolamine had reached its major impact during the Block 2 retrieval phase. Furthermore, the observation of significant disruptions to theta oscillations during this same retrieval phase provides strong evidence that the drug was in full effect at that time.

      (3) It is not totally clear to me why slow theta was excluded from the reinstatement analysis. For example, despite an overall reduction in theta power, relative patterns may have been retained between encoding and recall. What are the results when using 1-128 Hz as input frequencies?

      Slow theta (2–4 Hz) was excluded from the reinstatement analysis to avoid potential confounding effects. Given the observed disruption to slow theta power following scopolamine administration, any subsequent changes in slow theta reinstatement would be causally ambiguous, potentially arising directly from the power effects. Therefore, we would be unable to determine whether changes in slow theta reinstatement were genuinely independent of changes in power.

      (4) In what way are the results affected by epileptic artifacts occurring during the task (in particular, IEDs)?

      To exclude abnormal events and interictal activity, a kurtosis threshold of 4 was applied to each trial, effectively filtering out segments exhibiting significant epileptic artifacts.

      Reviewer #2 (Public review):

      Summary:

      In this study, performed in human patients, the authors aimed at dissecting out the role of cholinergic modulation in different types of memory (recollection-based vs familiarity and novelty-based) and during different memory phases (encoding and retrieval). Moreover, their goal was to obtain the electrophysiological signature of cholinergic modulation on network activity of the hippocampus and the entorhinal cortex.

      Strengths:

      The authors combined cognitive tasks and intracranial EEG recordings in neurosurgical epilepsy patients. The study confirms previous evidence regarding the deleterious effects of scopolamine, a muscarinic acetylcholine receptor antagonist, on memory performance when administered prior to the encoding phase of the task. During both encoding and retrieval phases, scopolamine disrupts the power of theta oscillations in terms of amplitude and phase synchronization. These results raise the question of the role of theta oscillations during retrieval and the meaning of scopolamine's effect on retrieval-associated theta rhythm without cognitive changes. The authors clearly discussed this issue in the discussion session. A major point is the finding that the scopolamine-mediated effect is selective for recollection-based memory and not for familiarity- and novelty-based memory.

      The methodology used is powerful, and the data underwent a detailed and rigorous analysis.

      Weaknesses:

      A limited cohort of patients; the age of the patients is not specified in the table.

      To comply with human subject privacy protection policies, age was not reported; however, we did not find any significant effects of age on the behavioral or neural measures.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Introduction & Theory

      (1) It is difficult to appreciate why the first trial of extinction in a standard protocol does NOT produce the retrieval-extinction effect. This applies to the present study as well as others that have purported to show a retrieval-extinction effect. The importance of this point comes through at several places in the paper. E.g., the two groups in Study 1 experienced a different interval between the first and second CS extinction trials; and the results varied with this interval: a longer interval (10 min) ultimately resulted in less reinstatement of fear than a shorter interval. Even if the different pattern of results in these two groups was shown/known to imply two different processes, there is nothing in the present study that addresses what those processes might be. That is, while the authors talk about mechanisms of memory updating, there is little in the present study that permits any clear statement about mechanisms of memory. The references to a "short-term memory update" process do not help the reader to understand what is happening in the protocol.

      We agree with the reviewer that whether and how the retrieval-extinction paradigm works is still under debate. Our results provide another line of evidence that such a paradigm is effective in producing long term fear amnesia. The focus of the current manuscript is to demonstrate that the retrieval-extinction paradigm can also facilitate a short-term fear memory deficit measured by SCR. Our TMS study provided some preliminary evidence in terms of the brain mechanisms involved in the causal relationship between the dorsolateral prefrontal cortex (dlPFC) activity and the short-term fear amnesia and showed that both the retrieval interval and the intact dlPFC activity were necessary for the short-term fear memory deficit and accordingly were referred to as the “mechanism” for memory update. We acknowledge that the term “mechanism” might have different connotations for different researchers. We now more explicitly clarify what we mean by “mechanisms” in the manuscript (line 99) as follows:

      “In theory, different cognitive mechanisms underlying specific fear memory deficits, therefore, can be inferred based on the difference between memory deficits.”

      In reply to this point, the authors cite evidence to suggest that "an isolated presentation of the CS+ seems to be important in preventing the return of fear expression." They then note the following: "It has also been suggested that only when the old memory and new experience (through extinction) can be inferred to have been generated from the same underlying latent cause, the old memory can be successfully modified (Gershman et al., 2017). On the other hand, if the new experiences are believed to be generated by a different latent cause, then the old memory is less likely to be subject to modification. Therefore, the way the 1stand 2ndCS are temporally organized (retrieval-extinction or standard extinction) might affect how the latent cause is inferred and lead to different levels of fear expression from a theoretical perspective." This merely begs the question: why might an isolated presentation of the CS+ result in the subsequent extinction experiences being allocated to the same memory state as the initial conditioning experiences? This is not yet addressed in any way.

      As in our previous response, this manuscript is not about investigating the cognitive mechanism why and how an isolated presentation of the CS+ would suppress fear expression in the long term. As the reviewer is aware, and as we have addressed in our previous response letters, both the positive and negative evidence abounds as to whether the retrieval-extinction paradigm can successfully suppress the long-term fear expression. Previous research depicted mechanisms instigated by the single CS+ retrieval at the molecular, cellular, and systems levels, as well as through cognitive processes in humans. In the current manuscript, we simply set out to test that in addition to the long-term fear amnesia, whether the retrieval-extinction paradigm can also affect subjects’ short-term fear memory.

      (2) The discussion of memory suppression is potentially interesting but, in its present form, raises more questions than it answers. That is, memory suppression is invoked to explain a particular pattern of results but I, as the reader, have no sense of why a fear memory would be better suppressed shortly after the retrieval-extinction protocol compared to the standard extinction protocol; and why this suppression is NOT specific to the cue that had been subjected to the retrieval-extinction protocol.

      Memory suppression is the hypothesis we proposed that might be able to explain the results we obtained in the experiments. We discussed the possibility of memory suppression and listed the reasons why such a mechanism might be at work. As we mentioned in the manuscript, our findings are consistent with the memory suppression mechanism on at least two aspects: 1) cue-independence and 2) thought-control ability dependence. We agree that the questions raised by the reviewer are interesting but to answer these questions would require a series of further experiments to disentangle all the various variables and conceptual questions about the purpose of a phenomenon, which we are afraid is out of the scope of the current manuscript. We refer the reviewer to the discussion section where memory suppression might be the potential mechanism for the short-term amnesia we observed (lines 562-569) as follows:

      “Previous studies indicate that a suppression mechanism can be characterized by three distinct features: first, the memory suppression effect tends to emerge early, usually 10-30 mins after memory suppression practice and can be transient (MacLeod and Macrae, 2001; Saunders and MacLeod, 2002); second, the memory suppression practice seems to directly act upon the unwanted memory itself (Levy and Anderson, 2002), such that the presentation of other cues originally associated with the unwanted memory also fails in memory recall (cue-independence); third, the magnitude of memory suppression effects is associated with individual difference in control abilities over intrusive thoughts (Küpper et al., 2014).”

      (3) Relatedly, how does the retrieval-induced forgetting (which is referred to at various points throughout the paper) relate to the retrieval-extinction effect? The appeal to retrieval-induced forgetting as an apparent justification for aspects of the present study reinforces points 2 and 3 above. It is not uninteresting but lacks clarification/elaboration and, therefore, its relevance appears superficial at best.

      We brought the topic of retrieval-induced forgetting (RIF) to stress the point that memory suppression can be unconscious. In a standard RIF paradigm, unlike the think/no-think paradigm, subjects are not explicitly told to suppress the non-target memories. However, to successfully retrieve the target memory, the cognitive system actively inhibits the non-target memories, effectively implementing a memory suppression mechanism (though unconsciously). Therefore, it is possible our results might be explained by the memory suppression framework. We elaborated this point in the discussion section (lines 578-584): 

      “In our experiments, subjects were not explicitly instructed to suppress their fear expression, yet the retrieval-extinction training significantly decreased short-term fear expression. These results are consistent with the short-term amnesia induced with the more explicit suppression intervention (Anderson et al., 1994; Kindt and Soeter, 2018; Speer et al., 2021; Wang et al., 2021; Wells and Davies, 1994). It is worth noting that although consciously repelling unwanted memory is a standard approach in memory suppression paradigm, it is possible that the engagement of the suppression mechanism can be unconscious.”

      (4) I am glad that the authors have acknowledged the papers by Chalkia, van Oudenhove & Beckers (2020) and Chalkia et al (2020), which failed to replicate the effects of retrieval-extinction reported by Schiller et al in Reference 6. The authors have inserted the following text in the revised manuscript: "It should be noted that while our long-term amnesia results were consistent with the fear memory reconsolidation literature, there were also studies that failed to observe fear prevention (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; Schroyens et al., 2023). Although the memory reconsolidation framework provides a viable explanation for the long-term amnesia, more evidence is required to validate the presence of reconsolidation, especially at the neurobiological level (Elsey et al., 2018). While it is beyond the scope of the current study to discuss the discrepancies between these studies, one possibility to reconcile these results concerns the procedure for the retrieval-extinction training. It has been shown that the eligibility for old memory to be updated is contingent on whether the old memory and new observations can be inferred to have been generated by the same latent cause (Gershman et al., 2017; Gershman and Niv, 2012). For example, prevention of the return of fear memory can be achieved through gradual extinction paradigm, which is thought to reduce the size of prediction errors to inhibit the formation of new latent causes (Gershman, Jones, et al., 2013). Therefore, the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause." Firstly, if it is beyond the scope of the present study to discuss the discrepancies between the present and past results, it is surely beyond the scope of the study to make any sort of reference to clinical implications!!!

      As we have clearly stated in our manuscript that this paper was not about discussing why some literature was or was not able to replicate the retrieval-extinction results originally reported by Schiller et al. 2010. Instead, we aimed to report a novel short-term fear amnesia through the retrieval-extinction paradigm, above and beyond the long-term amnesia reported before. Speculating about clinical implications of these finding is unrelated to the long-term, amnesia debate in the reconsolidation world. We now refer the reader to several perspectives and reviews that have proposed ways to resolve these discrepancies as follows (lines 642-673).

      Secondly, it is perfectly fine to state that "the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause..." This is not uninteresting, but it also isn't saying much. Minimally, I would expect some statement about factors that are likely to determine whether one is or isn't likely to see a retrieval-extinction effect, grounded in terms of this theory.

      Again, as we have responded many times, we simply do not know why some studies were able to suppress the fear expression using the retrieval-extinction paradigm and other studies weren’t. This is still an unresolved issue that the field is actively engaging with, and we now refer the reader to several papers dealing with this issue. However, this is NOT the focus of our manuscript. Having a healthy debate does not mean that every study using the retrieval-extinction paradigm must address the long-standing question of why the retrieval-extinction paradigm is effective (at least in some studies).

      Clarifications, Elaborations, Edits

      (5) Some parts of the paper are not easy to follow. Here are a few examples (though there are others):

      (a) In the abstract, the authors ask "whether memory retrieval facilitates update mechanisms other than memory reconsolidation"... but it is never made clear how memory retrieval could or should "facilitate" a memory update mechanism.

      We meant to state that the retrieval-extinction paradigm might have effects on fear memory, above and beyond the purported memory reconsolidation effect. Sentence modified (lines 25-26) as follows:

      “Memory reactivation renders consolidated memory fragile and thereby opens the window for memory updates, such as memory reconsolidation.”

      (b) The authors state the following: "Furthermore, memory reactivation also triggers fear memory reconsolidation and produces cue specific amnesia at a longer and separable timescale (Study 2, N = 79 adults)." Importantly, in study 2, the retrieval-extinction protocol produced a cue-specific disruption in responding when testing occurred 24 hours after the end of extinction. This result is interesting but cannot be easily inferred from the statement that begins "Furthermore..." That is, the results should be described in terms of the combined effects of retrieval and extinction, not in terms of memory reactivation alone; and the statement about memory reconsolidation is unnecessary. One can simply state that the retrieval-extinction protocol produced a cue-specific disruption in responding when testing occurred 24 hours after the end of extinction.

      The sentence the reviewer referred to was in our original manuscript submission but had since been modified based on the reviewer’s comments from last round of revision. Please see the abstract (lines 30-35) of our revised manuscript from last round of revision:

      “Furthermore, across different timescales, the memory retrieval-extinction paradigm triggers distinct types of fear amnesia in terms of cue-specificity and cognitive control dependence, suggesting that the short-term fear amnesia might be caused by different mechanisms from the cue-specific amnesia at a longer and separable timescale (Study 2, N = 79 adults).”

      (c) The authors also state that: "The temporal scale and cue-specificity results of the short-term fear amnesia are clearly dissociable from the amnesia related to memory reconsolidation, and suggest that memory retrieval and extinction training trigger distinct underlying memory update mechanisms." ***The pattern of results when testing occurred just minutes after the retrieval-extinction protocol was different to that obtained when testing occurred 24 hours after the protocol. Describing this in terms of temporal scale is unnecessary; and suggesting that memory retrieval and extinction trigger different memory update mechanisms is not obviously warranted. The results of interest are due to the combined effects of retrieval+extinction and there is no sense in which different memory update mechanisms should be identified with the different pattern of results obtained when testing occurred either 30 min or 24 hours after the retrieval-extinction protocol (at least, not the specific pattern of results obtained here).

      Again, we are afraid that the reviewer referred to the abstract in the original manuscript submission, instead of the revised abstract we submitted in the last round. Please see lines 37-39 of the revised abstract where the sentence was already modified (or the abstract from last round of revision).

      The facts that the 30min, 6hr and 24hr test results are different in terms of their cue-specificity and thought-control ability dependence are, to us, an important discovery in terms of delineating different cognitive processes at work following the retrieval-extinction paradigm. We want to emphasize that the fear memories after going through the retrieval-extinction paradigm showed interesting temporal dynamics in terms of their magnitudes, cue-specificity and thought-control ability dependence.

      (d) The authors state that: "We hypothesize that the labile state triggered by the memory retrieval may facilitate different memory update mechanisms following extinction training, and these mechanisms can be further disentangled through the lens of temporal dynamics and cue-specificities." *** The first part of the sentence is confusing around usage of the term "facilitate"; and the second part of the sentence that references a "lens of temporal dynamics and cue-specificities" is mysterious. Indeed, as all rats received the same retrieval-extinction exposures in Study 2, it is not clear how or why any differences between the groups are attributed to "different memory update mechanisms following extinction"

      The term “facilitate” was used to highlight the fact that the short-term fear amnesia effect is also memory retrieval dependent, as study 1 demonstrated. The novelty of the short-term fear memory deficit can be distinguished from the long-term memory effect via cue-specificity and thought-control ability dependence. Sentence has been modified (lines 97-101) as follows:

      “We hypothesize that the labile state triggered by the memory retrieval may facilitate different memory deficits following extinction training, and these deficits can be further disentangled through the lens of temporal dynamics and cue-specificities. In theory, different cognitive mechanisms underlying specific fear memory deficits, therefore, can be inferred based on the difference between memory deficits.”

      Data

      (6A) The eight participants who were discontinued after Day 1 in Study 1 were all from the no reminder group. The authors should clarify how participants were allocated to the two groups in this experiment so that the reader can better understand why the distribution of non-responders was non-random (as it appears to be).

      (6B) Similarly, in study 2, of the 37 participants that were discontinued after Day 2, 19 were from Group 30 min and 5 were from Group 6 hours. The authors should comment on how likely these numbers are to have been by chance alone. I presume that they reflect something about the way that participants were allocated to groups: e.g., the different groups of participants in studies 1 and 2 could have been run at quite different times (as opposed to concurrently). If this was done, why was it done? I can't see why the study should have been conducted in this fashion - this is for myriad reasons, including the authors' concerns re SCRs and their seasonal variations.

      As we responded in the previous response letters (as well as in the revised the manuscript), subjects were excluded because their SCR did not reach the threshold of 0.02 S when electric shock was applied. Subjects were assigned to different treatments daily (eg. Day 1 for the reminder group and Day 2 for no-reminder group) to avoid potential confusion in switching protocols to different subjects within the same day. We suspect that the non-responders might be related to the body thermal conditions caused by the lack of central heating for specific dates. Please note that the discontinued subjects (non-responders) were let go immediately after the failure to detect their SCR (< 0.02 S) on Day 1 and never invited back on Day 2, so it’s possible that the discontinued subjects were all from certain dates on which the body thermal conditions were not ideal for SCR collection. Despite the number of excluded subjects, we verified the short-term fear amnesia effect in three separate studies, which to us should serve as strong evidence in terms of the validity of the effect.

      (6C) In study 2, why is responding to the CS- so high on the first test trial in Group 30 min? Is the change in responding to the CS- from the last extinction trial to the first test trial different across the three groups in this study? Inspection of the figure suggests that it is higher in Group 30 min relative to Groups 6 hours and 24 hours. If this is confirmed by the analysis, it has implications for the fear recovery index which is partly based on responses to the CS-. If not for differences in the CS- responses, Groups 30 min and 6 hours are otherwise identical. That is, the claim of differential recovery to the CS1 and CS2 across time may simply an artefact of the way that the recovery index was calculated. This is unfortunate but also an important feature of the data given the way in which the fear recovery index was calculated.

      We have provided detailed analysis to this question in our previous response letter, and we are posting our previous response there:

      Following the reviewer’s comments, we went back and calculated the mean SCR difference of CS- between the first test trial and the last extinction trial for all three studies (see Author response image 1 below). In study 1, there was no difference in the mean CS- SCR (between the first test trial and last extinction trial) between the reminder and no-reminder groups (Kruskal-Wallis test , though both groups showed significant fear recovery even in the CS- condition (Wilcoxon signed rank test, reminder: P = 0.0043, no-reminder: P = 0.0037). Next, we examined the mean SCR for CS- for the 30min, 6h and 24h groups in study 2 and found that there was indeed a group difference (one-way ANOVA,F<sub>2.76</sub> = 5.3462, P = 0.0067, panel b), suggesting that the CS- related SCR was influenced by the test time (30min, 6h or 24h). We also tested the CS- related SCR for the 4 groups in study 3 (where test was conducted 1 hour after the retrieval-extinction training) and found that across TMS stimulation types (PFC vs. VER) and reminder types (reminder vs. no-reminder) the ANOVA analysis did not yield main effect of TMS stimulation type (F<sub>1.71</sub> = 0.322, P = 0.572) nor main effect of reminder type (F<sub>1.71</sub> = 0.0499, P = 0.824, panel c). We added the R-VER group results in study 3 (see panel c) to panel b and plotted the CS- SCR difference across 4 different test time points and found that CS- SCR decreased as the test-extinction delay increased (Jonckheere-Terpstra test, P = 0.00028). These results suggest a natural “forgetting” tendency for CS- related SCR and highlight the importance of having the CS- as a control condition to which the CS+ related SCR was compared with.

      Author response image 1.

      (6D) The 6 hour group was clearly tested at a different time of day compared to the 30 min and 24 hour groups. This could have influenced the SCRs in this group and, thereby, contributed to the pattern of results obtained.

      Again, we answered this question in our previous response. Please see the following for our previous response:

      For the 30min and 24h groups, the test phase can be arranged in the morning, in the afternoon or at night. However, for the 6h group, the test phase was inevitably in the afternoon or at night since we wanted to exclude the potential influence of night sleep on the expression of fear memory (see Author response table 1 below). If we restricted the test time in the afternoon or at night for all three groups, then the timing of their extinction training was not matched.

      Author response table 1.

      Nevertheless, we also went back and examined the data for the subjects only tested in the afternoon or at nights in the 30min and 24h groups to match with the 6h group where all the subjects were tested either in the afternoon or at night. According to the table above, we have 17 subjects for the 30min group (9+8),18 subjects for the 24h group (9 + 9) and 26 subjects for the 6h group (12 + 14). As Author response image 2 shows, the SCR patterns in the fear acquisition, extinction and test phases were similar to the results presented in the original figure.

      Author response image 2.

      (6E) The authors find different patterns of responses to CS1 and CS2 when they were tested 30 min after extinction versus 24 h after extinction. On this basis, they infer distinct memory update mechanisms. However, I still can't quite see why the different patterns of responses at these two time points after extinction need to be taken to infer different memory update mechanisms. That is, the different patterns of responses at the two time points could be indicative of the same "memory update mechanism" in the sense that the retrieval-extinction procedure induces a short-term memory suppression that serves as the basis for the longer-term memory suppression (i.e., the reconsolidation effect). My pushback on this point is based on the notion of what constitutes a memory update mechanism; and is motivated by what I take to be a rather loose use of language/terminology in the reconsolidation literature and this paper specifically (for examples, see the title of the paper and line 2 of the abstract).

      As we mentioned previously, the term “mechanism” might have different connotations for different researchers. We aim to report a novel memory deficit following the retrieval-extinction paradigm, which differed significantly from the purported reconsolidation related long-term fear amnesia in terms of its timescale, cue-specificity and thought-control ability. Further TMS study confirmed that the intact dlPFC function is necessary for the short-term memory deficit. It’s based on these results we proposed that the short-term fear amnesia might be related to a different cognitive “mechanism”. As mentioned above, we now clarify what we mean by “mechanism” in the abstract and introduction (lines 31-34, 97-101).

      Reviewer #2 (Public review):

      The fear acquisition data is converted to a differential fear SCR and this is what is analysed (early vs late). However, the figure shows the raw SCR values for CS+ and CS- and therefore it is unclear whether acquisition was successful (despite there being an "early" vs "late" effect - no descriptives are provided).

      (1) There are still no descriptive statistics to substantiate learning in Experiment 1.

      We answered this question in our previous response letter. We are sorry that the definition of “early” and “late” trials was scattered in the manuscript. For example, we wrote “the late phase of acquisition (last 5 trials)” (Line 375-376) in the results section. Since there were 10 trials in total for the acquisition stage, we define the first 5 trials and the last 5 trials as “early” and “late” phases of the acquisition stage and explicitly added them into the first occasion “early” and “late” terms appeared (lines 316-318).

      In the results section, we did test whether the acquisition was successful in our previous manuscript (Line 316-325):

      “To assess fear acquisition across groups (Figure 1B and C), we conducted a mixed two-way ANOVA of group (reminder vs. no-reminder) x time (early vs. late part of the acquisition; first 5 and last 5 trials, correspondingly) on the differential fear SCR. Our results showed a significant main effect of time (early vs. late; F<sub>1,55</sub> \= 6.545, P \= 0.013, η<sup>2</sup> \= 0.106), suggesting successful fear acquisition in both groups. There was no main effect of group (reminder vs. no-reminder) or the group x time interaction (group: F<sub>1,55</sub> \= 0.057, P \= 0.813, η<sup>2</sup> \= 0.001; interaction: F<sub>1,55</sub> \= 0.066, P \= 0.798, η<sup>2</sup> \= 0.001), indicating similar levels of fear acquisition between two groups. Post-hoc t-tests confirmed that the fear responses to the CS+ were significantly higher than that of CS- during the late part of acquisition phase in both groups (reminder group: t<sub>29</sub> \= 6.642, P < 0.001; no-reminder group: t<sub>26</sub> = 8.522, P < 0.001; Figure 1C). Importantly, the levels of acquisition were equivalent in both groups (early acquisition: t<sub>55</sub> \= -0.063, P \= 0.950; late acquisition: t<sub>55</sub> \= -0.318, P \= 0.751; Figure 1C).”

      In Experiment 1 (Test results) it is unclear whether the main conclusion stems from a comparison of the test data relative to the last extinction trial ("we defined the fear recovery index as the SCR difference between the first test trial and the last extinction trial for a specific CS") or the difference relative to the CS- ("differential fear recovery index between CS+ and CS-"). It would help the reader assess the data if Fig 1e presents all the indexes (both CS+ and CS-). In addition, there is one sentence which I could not understand "there is no statistical difference between the differential fear recovery indexes between CS+ in the reminder and no reminder groups (P=0.048)". The p value suggests that there is a difference, yet it is not clear what is being compared here. Critically, any index taken as a difference relative to the CS- can indicate recovery of fear to the CS+ or absence of discrimination relative to the CS-, so ideally the authors would want to directly compare responses to the CS+ in the reminder and no-reminder groups. In the absence of such comparison, little can be concluded, in particular if SCR CS- data is different between groups. The latter issue is particularly relevant in Experiment 2, in which the CS- seems to vary between groups during the test and this can obscure the interpretation of the result.

      (2) In the revised analyses, the authors now show that CS- changes in different groups (for example, Experiment 2) so this means that there is little to conclude from the differential scores because these depend on CS-. It is unclear whether the effects arise from CS+ performance or the differential which is subject to CS- variations.

      There was a typo in the “P = 0.048” sentence and we have corrected it in our last response letter. Also in the previous response letter, we specifically addressed how the fear recovery index was defined (also in the revised manuscript).

      In most of the fear conditioning studies, CS- trials were included as the baseline control. In turn, most of the analyses conducted also involved comparisons between different groups. Directly comparing CS+ trials across groups (or conditions) is rare. In our study 2, we showed that the CS- response decreased as a function of testing delays (30min, 1hr, 6hr and 24hr). Ideally, it would be nice to show that the CS- across groups/conditions did not change. However, even in those circumstances, comparisons are still based on the differential CS response (CS+ minus CS-), that is, the difference of difference. It is also important to note that difference score is important as CS+ alone or across conditions is difficult to interpret, especially in humans, due to noise, signal fluctuations, and irrelevant stimulus features; therefore trials-wise reference is essential to assess the CS+ in the context of a reference stimulus in each trial (after all, the baselines are different). We are listing a few influential papers in the field that the CS- responses were not particularly equivalent across groups/conditions and argue that this is a routine procedure (Kindt & Soeter 2018 Figs. 2-3; Sevenster et al., 2013 Fig. 3; Liu et al., 2014 Fig. 1; Raio et al., 2017 Fig. 2).

      In experiment 1, the findings suggest that there is a benefit of retrieval followed by extinction in a short-term reinstatement test. In Experiment 2, the same effect is observed to a cue which did not undergo retrieval before extinction (CS2+), a result that is interpreted as resulting from cue-independence, rather than a failure to replicate in a within-subjects design the observations of Experiment 1 (between-subjects). Although retrieval-induced forgetting is cue-independent (the effect on items that are suppressed [Rp-] can be observed with an independent probe), it is not clear that the current findings are similar, and thus that the strong parallels made are not warranted. Here, both cues have been extinguished and therefore been equally exposed during the critical stage.

      (3) The notion that suppression is automatic is speculative at best

      We have responded the same question in our previous revision. Please note that our results from study 1 (the comparison between reminder and no-reminder groups) was not set up to test the cue-independence hypothesis for the short-term amnesia with only one CS+. Results from both study 2 (30min condition) and study 3 confirmed the cue-independence hypothesis and therefore we believe interpreting results from study 2 as “a failure to replicate in a within-subject design of the observations of Experiment 1” is not the case.

      We agree that the proposal of automatic or unconscious memory suppression is speculative and that’s why we mentioned it in the discussion. The timescale, cue-specificity and the thought-control ability dependence of the short-term fear amnesia identified in our studies was reminiscent of the memory suppression effects reported in the previous literature. However, memory suppression typically adopted a conscious “suppression” treatment (such as the think/no-think paradigm), which was absent in the current study. However, the retrieval-induced forgetting (RIF), which is also considered a memory suppression paradigm via inhibitory control, does not require conscious effort to suppress any particular thought. Based on these results and extant literature, we raised the possibility of memory suppression as a potential mechanism. We make clear in the discussion that the suppression hypothesis and connections with RIF will require further evidence (lines 615-616):

      “future research will be needed to investigate whether the short-term effect we observed is specifically related to associative memory or the spontaneous nature of suppression as in RIF (Figure 6C).”

      (4) It still struggle with the parallels between these findings and the "limbo" literature. Here you manipulated the retention interval, whereas in the cited studies the number of extinction (exposure) was varied. These are two completely different phenomena.

      We borrowed the “limbo” term to stress the transitioning from short-term to long-term memory deficits (the 6hr test group). Merlo et al. (2014) found that memory reconsolidation and extinction were dissociable processes depending on the extent of memory retrieval. They argued that there was a “limbo” transitional state, where neither the reconsolidation nor the extinction process was engaged. Our results suggest that at the test delay of 6hr, neither the short-term nor the long-term effect was present, signaling a “transitional” state after which the short-term memory deficit wanes and the long-term deficit starts to take over. We make this idea more explicit as follows (lines 622-626):

      “These works identified important “boundary conditions” of memory retrieval in affecting the retention of the maladaptive emotional memories. In our study, however, we showed that even within a boundary condition previously thought to elicit memory reconsolidation, mnemonic processes other than reconsolidation could also be at work, and these processes jointly shape the persistence of fear memory.”

      (5) My point about the data problematic for the reconsolidation (and consolidation) frameworks is that they observed memory in the absence of the brain substrates that are needed for memory to be observed. The answer did not address this. I do not understand how the latent cause model can explain this, if the only difference is the first ITI. Wouldn't participants fail to integrate extinction with acquisition with a longer ITI?

      We take the sentence “they observed memory in the absence of the brain substrates that are needed for memory to be observed” as referring to the long-term memory deficit in our study. As we responded before, the aim of this manuscript was not about investigating the brain substrates involved in memory reconsolidation (or consolidation). Using a memory retrieval-extinction paradigm, we discovered a novel short-term memory effect, which differed from the purported reconsolidation effect in terms of timescale, cue-specificity and thought-control ability dependence. We further showed that both memory retrieval and intact dlPFC functions were necessary to observe the short-term memory deficit effect. Therefore, we conclude that the brain mechanism involved in such an effect should be different from the one related to the purported reconsolidation effect. We make this idea more explicit as follows (lines 546-547):

      “Therefore, findings of the short-term fear amnesia suggest that the reconsolidation framework falls short to accommodate this more immediate effect (Figure 6A and B).”

      Whilst I could access the data in the OFS site, I could not make sense of the Matlab files as there is no signposting indicating what data is being shown in the files. Thus, as it stands, there is no way of independently replicating the analyses reported.

      (6) The materials in the OSF site are the same as before, they haven't been updated.

      Last time we thought the main issue was the OSF site not being publicly accessible and thus made it open to all visitors. We have added descriptive file to explain the variables to help visitors to replicate the analyses we took.

      (7) Concerning supplementary materials, the robustness tests are intended to prove that you 1) can get the same results by varying the statistical models or 2) you can get the same results when you include all participants. Here authors have done both so this does not help. Also, in the rebuttal letter, they stated "Please note we did not include non-learners in these analyses " which contradicts what is stated in the figure captions "(learners + non learners)"

      In the supplementary materials, we did the analyses of varying the statistical models and including both learners and non-learners separately, instead of both. In fact, in the supplementary material Figs. 1 & 2, we included all the participants and performed similar analysis as in the main text and found similar results (learners + non-learners). Also, in the text of the supplementary material, we used a different statistical analysis method to only learners (analyzing subjects reported in the main text using a different method) and achieved similar results. We believe this is exactly what the reviewer suggested us to do. Also there seems to be a misunderstanding for the "Please note we did not include non-learners in these analyses" sentence in the rebuttal letter. As the reviewer can see, the full sentence read “Please note we did not include non-learners in these analyses (the texts of the supplementary materials)”. We meant to express that the Figures and texts in the supplementary material reflect two approaches: 1) Figures depicting re-analysis with all the included subjects (learners + non learners); 2) Text describing different analysis with learners. We added clarifications to emphasize these approaches in the supplementary materials.

      (8) Finally, the literature suggesting that reconsolidation interference "eliminates" a memory is not substantiated by data nor in line with current theorising, so I invite a revision of these strong claims.

      We agree and have toned down the strong claims.

      Overall, I conclude that the revised manuscript did not address my main concerns.

      In both rounds of responses, we tried our best to address the reviewer’s concerns. We hope that the clarifications in this letter and revisions in the text address the remaining concerns. Thank you for your feedback.

      Reference:

      Kindt, M. and Soeter, M. 2018. Pharmacologically induced amnesia for learned fear is time and sleep dependent. Nat Commun, 9, 1316.

      Liu, J., Zhao, L., Xue, Y., Shi, J., Suo, L., Luo, Y., Chai, B., Yang, C., Fang, Q., Zhang, Y., Bao, Y., Pickens, C. L. and Lu, L. 2014. An unconditioned stimulus retrieval extinction procedure to prevent the return of fear memory. Biol Psychiatry, 76, 895-901.

      Raio, C. M., Hartley, C. A., Orederu, T. A., Li, J. and Phelps, E. A. 2017. Stress attenuates the flexible updating of aversive value. Proc Natl Acad Sci U S A, 114, 11241-11246.

      Sevenster, D., Beckers, T., & Kindt, M. 2013. Prediction error governs pharmacologically induced amnesia for learned fear. Science (New York, N.Y.), 339(6121), 830–833.

    1. Author response:

      The following is the authors’ response to the current reviews.

      eLife Assessment<br /> This study offers valuable insights into how humans detect and adapt to regime shifts, highlighting distinct contributions of the frontoparietal network and ventromedial prefrontal cortex to sensitivity to signal diagnosticity and transition probabilities. The combination of an innovative task design, behavioral modeling, and model-based fMRI analyses provides a solid foundation for the conclusions; however, the neuroimaging results have several limitations, particularly a potential confound between the posterior probability of a switch and the passage of time that may not be fully controlled by including trial number as a regressor. The control experiments intended to address this issue also appear conceptually inconsistent and, at the behavioral level, while informing participants of conditional probabilities rather than requiring learning is theoretically elegant, such information is difficult to apply accurately, as shown by well-documented challenges with conditional reasoning and base-rate neglect. Expressing these probabilities as natural frequencies rather than percentages may have improved comprehension. Overall, the study advances understanding of belief updating under uncertainty but would benefit from more intuitive probabilistic framing and stronger control of temporal confounds in future work.

      We thank the editors for the assessment. The editor added several limitations based on the new reviewer 3 in this round, which we address below.

      With regard to temporal confounds, we clarified in the main text and response to Reviewer 3 that we had already addressed the potential confound between posterior probability of a switch and passage of time in GLM-2 with the inclusion of intertemporal prior. After adding intertemporal prior in the GLM, we still observed the same fMRI results on probability estimates. In addition, we did two other robustness checks, which we mentioned in the manuscript.

      With regard to response mode (probability estimation rather than choice or indicating natural frequencies), we wish to point out that the in previous research by Massey and Wu (2005), which the current study was based on, the concern of participants showing system-neglect tendencies due to the mode of information delivery, namely indicating beliefs through reporting probability estimates rather than through choice or other response mode was addressed. Massy and Wu (2005, Study 3) found the same biases when participants performed a choice task that did not require them to indicate probability estimates.

      With regard to the control experiments, the control experiments in fact were not intended to address the confounds between posterior probability and passage of time. Rather, they aimed to address whether the neural findings were unique to change detection (Experiment 2) and to address visual and motor confounds (Experiment 3). These and the results of the control experiments were mentioned on page 18-19.

      Finally, we wish to highlight that we had performed detailed model comparisons after reviewer 2’s suggestions. Although reviewer 2 was unable to re-review the manuscript, we believe this provides insight into the literature on change detection. See “Incorporating signal dependency into system-neglect model led to better models for regime-shift detection” (p.27-30). The model comparison showed that system-neglect models that incorporate signal dependency are better models than the original system-neglect model in describing participants probability estimates. This suggests that people respond to change-consistent and change-inconsistent signals differently when judging whether the regime had changed. This was not reported in previous behavioral studies and was largely inspired by the neural finding on signal dependency in the frontoparietal cortex. It indicates that neural findings can provide novel insights into computational modeling of behavior.           

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study examines human biases in a regime-change task, in which participants have to report the probability of a regime change in the face of noisy data. The behavioral results indicate that humans display systematic biases, in particular, overreaction in stable but noisy environments and underreaction in volatile settings with more certain signals. fMRI results suggest that a frontoparietal brain network is selectively involved in representing subjective sensitivity to noise, while the vmPFC selectively represents sensitivity to the rate of change.

      Strengths:

      - The study relies on a task that measures regime-change detection primarily based on descriptive information about the noisiness and rate of change. This distinguishes the study from prior work using reversal-learning or change-point tasks in which participants are required to learn these parameters from experiences. The authors discuss these differences comprehensively.

      - The study uses a simple Bayes-optimal model combined with model fitting, which seems to describe the data well. The model is comprehensively validated.

      - The authors apply model-based fMRI analyses that provide a close link to behavioral results, offering an elegant way to examine individual biases.

      We thank the reviewer for the comments.

      Weaknesses:

      The authors have adequately addressed most of my prior concerns.

      We thank the reviewer for recognizing our effort in addressing your concerns.

      My only remaining comment concerns the z-test of the correlations. I agree with the non-parametric test based on bootstrapping at the subject level, providing evidence for significant differences in correlations within the left IFG and IPS.

      However, the parametric test seems inadequate to me. The equation presented is described as the Fisher z-test, but the numerator uses the raw correlation coefficients (r) rather than the Fisher-transformed values (z). To my understanding, the subtraction should involve the Fisher z-scores, not the raw correlations.

      More importantly, the Fisher z-test in its standard form assumes that the correlations come from independent samples, as reflected in the denominator (which uses the n of each independent sample). However, in my opinion, the two correlations are not independent but computed within-subject. In such cases, parametric tests should take into account the dependency. I believe one appropriate method for the current case (correlated correlation coefficients sharing a variable [behavioral slope]) is explained here:

      Meng, X.-l., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111(1), 172-175. https://doi.org/10.1037/0033-2909.111.1.172

      It should be implemented here:

      Diedenhofen B, Musch J (2015) cocor: A Comprehensive Solution for the Statistical Comparison of Correlations. PLoS ONE 10(4): e0121945. https://doi.org/10.1371/journal.pone.0121945

      My recommendation is to verify whether my assumptions hold, and if so, perform a test that takes correlated correlations into account. Or, to focus exclusively on the non-parametric test.

      In any case, I recommend a short discussion of these findings and how the authors interpret that some of the differences in correlations are not significant.

      Thank you for the careful check. Yes. This was indeed a mistake from us. We also agree that the two correlations are not independent. Therefore, we modified the test that accounts for dependent correlations by following Meng et al. (1992) suggested by the reviewer.

      We referred to the correlation between neural and behavioral sensitivity at change-consistent (blue) signals as , and that at change-inconsistent (red) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. To statistically compare these two correlations, we adopted the approach of Meng et al. (1992), which specifically tests differences between dependent correlations according to the following equation

      where  is the number of subjects, 𝑧<sub>𝑟𝑖</sub> is the Fisher z-transformed value of 𝑟<sub>𝑖</sub>, 𝑟<sub>1</sub> = 𝑟<sub>𝑏𝑙𝑢𝑒</sub> and 𝑟<sub>2</sub> = 𝑟<sub>𝑟𝑒𝑑</sub>. 𝑟<sub>𝑥</sub> is the correlation between the neural sensitivity at change-consistent signals and change-inconsistent signals.

      Where is the mean of the , and 𝑓 should be set to 1 if > 1.

      We found that among the five ROIs in the frontoparietal network, two of them, namely the left IFG and left IPS, the difference in correlation was significant (one-tailed z test; left IFG: 𝑧 = 1.8908, 𝑝 = 0.0293; left IPS: 𝑧 = 2.2584, 𝑝 = 0.0049). For the remaining three ROIs, the difference in correlation was not significant (dmPFC: 𝑧 = 0.9522, 𝑝 = 0.1705; right IFG: 𝑧 = 0.9860, 𝑝 = 0.1621; right IPS: 𝑧 = 1.4833, 𝑝 = 0.0690). We chose one-tailed test because we already know the correlation under the blue signals was significantly greater than 0. These updated results are consistent with the nonparametric tests we had already performed and we will update them in the revised manuscript.

      Reviewer #3 (Public review):

      This study concerns how observers (human participants) detect changes in the statistics of their environment, termed regime shifts. To make this concrete, a series of 10 balls are drawn from an urn that contains mainly red or mainly blue balls. If there is a regime shift, the urn is changed over (from mainly red to mainly blue) at some point in the 10 trials. Participants report their belief that there has been a regime shift as a % probability. Their judgement should (mathematically) depend on the prior probability of a regime shift (which is set at one of three levels) and the strength of evidence (also one of three levels, operationalized as the proportion of red balls in the mostly-blue urn and vice versa). Participants are directly instructed of the prior probability of regime shift and proportion of red balls, which are presented on-screen as numerical probabilities. The task therefore differs from most previous work on this question in that probabilities are instructed rather than learned by observation, and beliefs are reported as numerical probabilities rather than being inferred from participants' choice behaviour (as in many bandit tasks, such as Behrens 2007 Nature Neurosci).

      The key behavioural finding is that participants over-estimate the prior probability of regime change when it is low, and under estimate it when it is high; and participants over-estimate the strength of evidence when it is low and under-estimate it when it is high. In other words participants make much less distinction between the different generative environments than an optimal observer would. This is termed 'system neglect'. A neuroeconomic-style mathematical model is presented and fit to data.

      Functional MRI results how that strength of evidence for a regime shift (roughly, the surprise associated with a blue ball from an apparently red urn) is associated with activity in the frontal-parietal orienting network. Meanwhile, at time-points where the probability of a regime shift is high, there is activity in another network including vmPFC. Both networks show individual differences effects, such that people who were more sensitive to strength of evidence and prior probability show more activity in the frontal-parietal and vmPFC-linked networks respectively.

      We thank the reviewer for the overall descriptions of the manuscript.

      Strengths:

      (1) The study provides a different task for looking at change-detection and how this depends on estimates of environmental volatility and sensory evidence strength, in which participants are directly and precisely informed of the environmental volatility and sensory evidence strength rather than inferring them through observation as in most previous studies

      (2) Participants directly provide belief estimates as probabilities rather than experimenters inferring them from choice behaviour as in most previous studies<br /> (3) The results are consistent with well-established findings that surprising sensory events activate the frontal-parietal orienting network whilst updating of beliefs about the word ('regime shift') activates vmPFC.

      Thank you for these assessments.

      Weaknesses:

      (1) The use of numerical probabilities (both to describe the environments to participants, and for participants to report their beliefs) may be problematic because people are notoriously bad at interpreting probabilities presented in this way, and show poor ability to reason with this information (see Kahneman's classic work on probabilistic reasoning, and how it can be improved by using natural frequencies). Therefore the fact that, in the present study, people do not fully use this information, or use it inaccurately, may reflect the mode of information delivery.

      We appreciate the reviewer’s concern on this issue. The concern was addressed in Massey and Wu (2005) as participants performed a choice task in which they were not asked to provide probability estimates (Study 3 in Massy and Wu, 2005). Instead, participants in Study 3 were asked to predict the color of the ball before seeing a signal. This was a more intuitive way of indicating his or her belief about regime shift. The results from the choice task were identical to those found in the probability estimation task (Study 1 in Massey and Wu). We take this as evidence that the system-neglect behavior the participants showed was less likely to be due to the mode of information delivery.

      (2) Although a very precise model of 'system neglect' is presented, many other models could fit the data.

      For example, you would get similar effects due to attraction of parameter estimates towards a global mean - essentially application of a hyper-prior in which the parameters applied by each participant in each block are attracted towards the experiment-wise mean values of these parameters. For example, the prior probability of regime shift ground-truth values [0.01, 0.05, 0.10] are mapped to subjective values of [0.037, 0.052, 0.069]; this would occur if observers apply a hyper-prior that the probability of regime shift is about 0.05 (the average value over all blocks). This 'attraction to the mean' is a well-established phenomenon and cannot be ruled out with the current data (I suppose you could rule it out by comparing to another dataset in which the mean ground-truth value was different).

      We thank the reviewer for this comment. It is true that the system-neglect model is not entirely inconsistent with regression to the mean, regardless of whether the implementation has a hyper prior or not. In fact, our behavioral measure of sensitivity to transition probability and signal diagnosticity, which we termed the behavioral slope, is based on linear regression analysis. In general, the modeling approach in this paper is to start from a generative model that defines ideal performance and consider modifying the generative model when systematic deviations in actual performance from the ideal is observed. In this approach, a generative model with hyper-prior would be more complex to begin with, and a regression to the mean idea by itself does not generate a priori predictions.

      More generally, any model in which participants don't fully use the numerical information they were given would produce apparent 'system neglect'. Four qualitatively different example reasons are: 1. Some individual participants completely ignored the probability values given. 2. Participants did not ignore the probability values given, but combined them with a hyperprior as above. 3. Participants had a reporting bias where their reported beliefs that a regime-change had occurred tend to be shifted towards 50% (rather than reporting 'confident' values such 5% or 95%). 4. Participants underweighted probability outliers resulting in underweighting of evidence in the 'high signal diagnosticity' environment (10.1016/j.neuron.2014.01.020 )

      In summary I agree that any model that fits the data would have to capture the idea that participants don't differentiate between the different environments as much as they should, but I think there are a number of qualitatively different reasons why they might do this - of which the above are only examples - hence I find it problematic that the authors present the behaviour as evidence for one extremely specific model.

      Thank you for raising this point. The modeling principle we adopt is the following. We start from the normative model—the Bayesian model—that defined what normative behavior should look like. We compared participants’ behavior with the Bayesian model and found systematic deviations from it. To explain those systematic deviations, we considered modeling options within the confines of the same modeling framework. In other words, we considered a parameterized version of the Bayesian model, which is the system-neglect model and examined through model comparison the best modeling choice. This modeling approach is not uncommon, and many would agree this is the standard approach in economics and psychology. For example, Kahneman and Tversky adopted this approach when proposing prospect theory, a modification of expected utility theory where expected utility theory can be seen as one specific model for how utility of an option should be computed.

      (3) Despite efforts to control confounds in the fMRI study, including two control experiments, I think some confounds remain.

      For example, a network of regions is presented as correlating with the cumulative probability that there has been a regime shift in this block of 10 samples (Pt). However, regardless of the exact samples shown, doesn't Pt always increase with sample number (as by the time of later samples, there have been more opportunities for a regime shift)? Unless this is completely linear, the effect won't be controlled by including trial number as a co-regressor (which was done).

      Thank you for raising this concern. Yes, Pt always increases with sample number regardless of evidence (seeing change-consistent or change-inconsistent signals). This is captured by the ‘intertemporal prior’ in the Bayesian model, which we included as a regressor in our GLM analysis (GLM-2), in addition to Pt. In short, GLM-1 had Pt and sample number. GLM-2 had Pt, intertemporal prior, and sample number, among other regressors. And we found that, in both GLM-1 and GLM-2, both vmPFC and ventral striatum correlated with Pt.

      To make this clearer, we updated the main text to further clarify this on p.18:

      On the other hand, two additional fMRI experiments are done as control experiments and the effect of Pt in the main study is compared to Pt in these control experiments. Whilst I admire the effort in carrying out control studies, I can't understand how these particular experiment are useful controls. For example in experiment 3 participants simply type in numbers presented on the screen - how can we even have an estimate of Pt from this task?

      We thank the reviewer for this comment. The purpose of Experiment 3 was to control for visual and motor confounds. In other words, if subjects saw the similar visual layout and were just instructed to press numbers, would we observe the vmPFC, ventral striatum, and the frontoparietal network like what we did in the main experiment (Experiment 1)?

      The purpose of Experiment 2 was to establish whether what we found about Pt was unique to change detection. In Experiment 2, subjects estimated the probability that the current regime is the blue regime (just as they did in Experiment 1) except that there were no regime shifts involved. In other words, it is possible that the regions we identified were generally associated with probability estimation and not particularly about change detection. And we used Experiment 2 to examine whether this were true.

      (4) The Discussion is very long, and whilst a lot of related literature is cited, I found it hard to pin down within the discussion, what the key contributions of this study are. In my opinion it would be better to have a short but incisive discussion highlighting the advances in understanding that arise from the current study, rather than reviewing the field so broadly.

      Thank you. We received different feedbacks from previous reviews on what to include in Discussion. To address the reviewer’s concern, we will revise the Discussion to better highlight the key contributions of the current study at the beginning of Discussion.

      Recommendations for the authors:

      Reviewer #3 (Recommendations for the authors):

      Many of the figures are too tiny - the writing is very small, as are the pictures of brains. I'd suggest adjusting these so they will be readable without enlarging.

      Thank you. We will enlarge the figures to make them more readable.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study examines human biases in a regime-change task, in which participants have to report the probability of a regime change in the face of noisy data. The behavioral results indicate that humans display systematic biases, in particular, overreaction in stable but noisy environments and underreaction in volatile settings with more certain signals. fMRI results suggest that a frontoparietal brain network is selectively involved in representing subjective sensitivity to noise, while the vmPFC selectively represents sensitivity to the rate of change.

      Strengths:

      (1) The study relies on a task that measures regime-change detection primarily based on descriptive information about the noisiness and rate of change. This distinguishes the study from prior work using reversal-learning or change-point tasks in which participants are required to learn these parameters from experiences. The authors discuss these differences comprehensively.

      Thank you for recognizing our contribution to the regime-change detection literature and our effort in discussing our findings in relation to the experience-based paradigms.

      (2) The study uses a simple Bayes-optimal model combined with model fitting, which seems to describe the data well.

      Thank you for recognizing the contribution of our Bayesian framework and systemneglect model.

      (3) The authors apply model-based fMRI analyses that provide a close link to behavioral results, offering an elegant way to examine individual biases.

      Thank you for recognizing our execution of model-based fMRI analyses and effort in using those analyses to link with behavioral biases.

      Weaknesses:

      My major concern is about the correlational analysis in the section "Under- and overreactions are associated with selectivity and sensitivity of neural responses to system parameters", shown in Figures 5c and d (and similarly in Figure 6). The authors argue that a frontoparietal network selectively represents sensitivity to signal diagnosticity, while the vmPFC selectively represents transition probabilities. This claim is based on separate correlational analyses for red and blue across different brain areas. The authors interpret the finding of a significant correlation in one case (blue) and an insignificant correlation (red) as evidence of a difference in correlations (between blue and red) but don't test this directly. This has been referred to as the "interaction fallacy" (Niewenhuis et al., 2011; Makin & Orban de Xivry 2019). Not directly testing the difference in correlations (but only the differences to zero for each case) can lead to wrong conclusions. For example, in Figure 5c, the correlation for red is r = 0.32 (not significantly different from zero) and r = 0.48 (different from zero). However, the difference between the two is 0.1, and it is likely that this difference itself is not significant. From a statistical perspective, this corresponds to an interaction effect that has to be tested directly. It is my understanding that analyses in Figure 6 follow the same approach.

      Relevant literature on this point is:

      Nieuwenhuis, S, Forstmann, B & Wagenmakers, EJ (2011). Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci 14, 11051107. https://doi.org/10.1038/nn.2886

      Makin TR, Orban de Xivry, JJ (2019). Science Forum: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. eLife 8:e48175. https://doi.org/10.7554/eLife.48175

      There is also a blog post on simulation-based comparisons, which the authors could check out: https://garstats.wordpress.com/2017/03/01/comp2dcorr/

      I recommend that the authors carefully consider what approach works best for their purposes. It is sometimes recommended to directly compare correlations based on Monte-Carlo simulations (cf Makin & Orban). It might also be appropriate to run a regression with the dependent variable brain activity (Y) and predictors brain area (X) and the model-based term of interest (Z). In this case, they could include an interaction term in the model:

      Y = \beta_0 + \beta_1 \cdot X + \beta_2 \cdot Z + \beta_3 \cdot X \cdot Z

      The interaction term reflects if the relationship between the model term Z and brain activity Y is conditional on the brain area of interest X.

      Thank you for the suggestion. In response, we tested for the difference in correlation both parametrically and nonparametrically. The results were identical. In the parametric test, we used the Fisher z transformation to transform the difference in correlation coefficients to the z statistic. That is, for two correlation coefficients, 𝑟<sub>1</sub> (with sample size 𝑛<sub>1</sub>) and 𝑟<sub>2</sub>, (with sample size 𝑛<sub>2</sub>), the z statistic of the difference in correlation is given by

      We referred to the correlation between neural and behavioral sensitivity at change-consistent (blue) signals as 𝑟<sub>𝑏𝑙𝑢𝑒</sub>, and that at change-inconsistent (red) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. For the Fisher z transformation 𝑟<sub>1</sub>= 𝑟<sub>𝑏𝑙𝑢𝑒</sub> and 𝑟<sub>2</sub> \= 𝑟<sub>𝑟𝑒𝑑</sub>. We found that among the five ROIs in the frontoparietal network, two of them, namely the left IFG and left IPS, the difference in correlation was significant (one-tailed z test; left IFG: 𝑧 = 1.8355, 𝑝 =0.0332; left IPS: 𝑧 = 2.3782, 𝑝 = 0.0087). For the remaining three ROIs, the difference in correlation was not significant (dmPFC: 𝑧 = 0.7594, 𝑝 = 0.2238; right IFG: 𝑧 = 0.9068, 𝑝 = 0.1822; right IPS: 𝑧 = 1.3764, 𝑝 = 0.0843). We chose one-tailed test because we already know the correlation under the blue signals was significantly greater than 0.

      In the nonparametric test, we performed nonparametric bootstrapping to test for the difference in correlation (Efron & Tibshirani, 1994). We resampled with replacement the dataset (subject-wise) and used the resampled dataset to compute the difference in correlation. We then repeated the above for 100,000 times so as to estimate the distribution of the difference in correlation coefficients, tested for significance and estimated p-value based on this distribution. Consistent with our parametric tests, here we also found that the difference in correlation was significant in left IFG and left IPS (left IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.46, 𝑝 = 0.0496; left IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.5306, 𝑝 = 0.0041), but was not significant in dmPFC, right IFG, and right IPS (dmPFC: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.1634, 𝑝 = 0.1919; right IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.2123, 𝑝 = 0.1681; right IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.3434, 𝑝 = 0.0631).

      In summary, we found that neural sensitivity to signal diagnosticity in the frontoparietal network measured at change-consistent signals significantly correlated with individual subjects’ behavioral sensitivity to signal diagnosticity (𝑟<sub>𝑏𝑙𝑢𝑒</sub>). By contrast, neural sensitivity to signal diagnosticity measured at change-inconsistent did not significantly correlate with behavioral sensitivity (𝑟<sub>𝑟𝑒𝑑</sub>). The difference in correlation, 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub>, however, was statistically significant in some (left IPS and left IFG) but not all brain regions within the frontoparietal network.

      To incorporate these updates, we added descriptions of the methods and results in the revised manuscript. In the Results section (p.26-27):

      “We further tested, for each brain region, whether the difference in correlation was significant using both parametric and nonparametric tests (see Parametric and nonparametric tests for difference in correlation coefficients in Methods). The results were identical. In the parametric test, we used the Fisher 𝑧 transformation to transform the difference in correlation coefficients to the 𝑧 statistic. We found that among the five ROIs in the frontoparietal network, two of them, namely the left IFG and left IPS, the difference in correlation was significant (one-tailed z test; left IFG: 𝑧 = 1.8355, 𝑝 = 0.0332; left IPS: 𝑧 = 2.3782, 𝑝 = 0.0087). For the remaining three ROIs, the difference in correlation was not significant (dmPFC: 𝑧 = 0.7594, 𝑝 = 0.2238; right IFG: 𝑧 = 0.9068, 𝑝 = 0.1822; right IPS: 𝑧 = 1.3764, 𝑝 = 0.0843). We chose one-tailed test because we already know the correlation under change-consistent signals was significantly greater than 0. In the nonparametric test, we performed nonparametric bootstrapping to test for the difference in correlation. We referred to the correlation between neural and behavioral sensitivity at change-consistent (blue) signals as 𝑟<sub>𝑏𝑙𝑢𝑒</sub>, and that at change-inconsistent (red) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. Consistent with the parametric tests, we also found that the difference in correlation was significant in left IFG and left IPS (left IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.46, 𝑝 = 0.0496; left IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.5306, 𝑝 = 0.0041), but was not significant in dmPFC, right IFG, and right IPS (dmPFC: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \=0.1634, 𝑝 = 0.1919; right IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.2123, 𝑝 = 0.1681; right IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.3434, 𝑝 = 0.0631). In summary, we found that neural sensitivity to signal diagnosticity measured at change-consistent signals significantly correlated with individual subjects’ behavioral sensitivity to signal diagnosticity. By contrast, neural sensitivity to signal diagnosticity measured at change-inconsistent signals did not significantly correlate with behavioral sensitivity. The difference in correlation, however, was statistically significant in some (left IPS and left IFG) but not all brain regions within the frontoparietal network.”

      In the Methods section, we added on p.53:

      “Parametric and nonparametric tests for difference in correlation coefficients. We implemented both parametric and nonparametric tests to examine whether the difference in Pearson correlation coefficients was significant. In the parametric test, we used the Fisher 𝑧 transformation to transform the difference in correlation coefficients to the 𝑧 statistic. That is, for two correlation coefficients, 𝑟<sub>1</sub> (with sample size 𝑛<sub>2</sub>) and 𝑟<sub>2</sub>, (with sample size 𝑛<sub>1</sub>), the 𝑧 statistic of the difference in correlation is given by

      We referred to the correlation between neural and behavioral sensitivity at changeconsistent (blue balls) signals as 𝑟<sub>𝑏𝑙𝑢𝑒</sub>, and that at change-inconsistent (red balls) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. For the Fisher 𝑧 transformation, 𝑟<sub>1</sub> \= 𝑟 𝑟<sub>𝑏𝑙𝑢𝑒</sub> and 𝑟<sub>2</sub> \= 𝑟<sub>𝑟𝑒𝑑</sub>. In the nonparametric test, we performed nonparametric bootstrapping to test for the difference in correlation (Efron & Tibshirani, 1994). That is, we resampled with replacement the dataset (subject-wise) and used the resampled dataset to compute the difference in correlation. We then repeated the above for 100,000 times so as to estimate the distribution of the difference in correlation coefficients, tested for significance and estimated p-value based on this distribution.”

      Another potential concern is that some important details about the parameter estimation for the system-neglect model are missing. In the respective section in the methods, the authors mention a nonlinear regression using Matlab's "fitnlm" function, but it remains unclear how the model was parameterized exactly. In particular, what are the properties of this nonlinear function, and what are the assumptions about the subject's motor noise? I could imagine that by using the inbuild function, the assumption was that residuals are Gaussian and homoscedastic, but it is possible that the assumption of homoscedasticity is violated, and residuals are systematically larger around p=0.5 compared to p=0 and p=1. Relatedly, in the parameter recovery analyses, the authors assume different levels of motor noise. Are these values representative of empirical values?

      We thank the reviewer for this excellent point. The reviewer touched on model parameterization, assumption of noise, and parameter recovery analysis. We answered these questions point-by-point below.

      On how our model was parameterized

      We parameterized the model according to the system-neglect model in Eq. (2) and estimated the alpha parameter separately for each level of transition probability and the beta parameter separately for each level of signal diagnosticity. As a result, we had a total of 6 parameters (3 alpha and 3 beta parameters) in the model. The system-neglect model is then called by fitnlm so that these parameters can be estimated. The term ‘nonlinear’ regression in fitnlm refers to the fact that you can specify any model (in our case the system-neglect model) and estimate its parameters when calling this function. In our use of fitnlm, we assume that the noise is Gaussian and homoscedastic (the default option).

      On the assumptions about subject’s motor noise

      We actually never called the noise ‘motor’ because it can be estimation noise as well. In the context of fitnlm, we assume that the noise is Gaussian and homoscedastic.

      On the possibility that homoscedasticity is violated

      We take the reviewer’s point. In response, we separately estimated the residual standard deviation at different probability intervals ([0.0–0.2), [0.2–0.4), [0.4–0.6), [0.6– 0.8), and [0.8–1.0]). The result is shown in the figure below. The black data points are the average residual standard deviation (across subjects) and the error bars are the standard error of the mean. The residual standard deviation is indeed heteroscedastic— smallest at 0.1 probability and increasing as probability increases and asymptote at 0.5 (Fig. S4).

      To examine how this would affect model fitting (parameter estimation), we performed parameter recovery analysis based on these empirically estimated, probabilitydependent residual standard deviation. That is, we simulated subjects’ probability estimates using the system-neglect model and added the heteroscedastic noise according to the empirical values and then estimated the parameter estimates of the system-neglect model. The recovered parameter estimates did not seem to be affected by the heteroscedasticity of the variance. The parameter recovery results were identical to the parameter recovery results when homoscedasticity was assumed. This suggested that although homoscedasticity was violated, it did not affect the accuracy of the parameter estimates (Fig.S4).

      We added a section ‘Impact of noise homoscedasticity on parameter estimation’ in Methods section (p.47-48) and a figure in the supplement (Fig. S4) to describe this:

      On whether the noise levels in parameter recovery analysis are representative of empirical values

      To address the reviewer’s question, we conducted a new analysis using maximum likelihood estimation to simultaneously estimate the system-neglect model and the noise level of each individual subject. To estimate each subject’s noise level, we incorporated a noise parameter into the system-neglect model. We assumed that probability estimates are noisy and modeled them with a Gaussian distribution where the noise parameter (𝜎,-./&) is the standard deviation. At each period, a probability estimate of regime shift was computed according to the system-neglect model where Θ is the set of parameters including parameters in the system-neglect model and the noise parameter. The likelihood function, 𝐿(Θ), is the probability of observing the subject’s actual probability estimate at period 𝑡, 𝑝), given Θ, 𝐿(Θ) = 𝑃(𝑝)|Θ). Since we modeled the noisy probability estimates with a Gaussian distribution, we can therefore express 𝐿(Θ) as 𝐿(Θ)~𝑁(𝑝); 𝑝)*+, 𝜎,-./&) where 𝑝)*+ is the probability estimate predicted by the system-neglect (SN) model at period 𝑡. As a reminder, we referred to a ‘period’ as the time when a new signal appeared during a trial (for a given transition probability and signal diagnosticity). To find that maximum likelihood estimates of ΘMLE, we summed over all periods the negative natural logarithm of likelihood and used MATLAB’s fmincon function to find ΘMLE. Across subjects, we found that the mean noise estimate was 0.1735 and ranged from 0.1118 to 0.2704 (Supplementary Figure S3).”

      Compared with our original parameter recovery analysis where the maximum noise level was set at 0.1, our data indicated that some subjects’ noise was larger than this value. Therefore, we expanded our parameter recovery analysis to include noise levels beyond 0.1 to up to 0.3. The results are now updated in Supplementary Fig. S3.

      We updated the parameter recovery section (p. 47) in Methods:

      The main study is based on N=30 subjects, as are the two control studies. Since this work is about individual differences (in particular w.r.t. to neural representations of noise and transition probabilities in the frontoparietal network and the vmPFC), I'm wondering how robust the results are. Is it likely that the results would replicate with a larger number of subjects? Can the two control studies be leveraged to address this concern to some extent?

      We can address the issue of robustness through looking at the effect size. In particular, with respect to individual differences in neural sensitivity of transition probability and signal diagnosticity, since the significant correlation coefficients between neural and behavioral sensitivity were between 0.4 and 0.58 for signal diagnosticity in frontoparietal network (Fig. 5C), and -0.38 and -0.37 for transition probability in vmPFC (Fig. 5D), the effect size of these correlation coefficients was considered medium to large (Cohen, 1992).

      It would be challenging to use the control studies to address the robustness concern. The two control studies did not allow us to examine individual differences – in particular with respect to neural selectivity of noise and transition probability – and therefore we think it is less likely to leverage the control studies. Having said that, it is possible to look at neural selectivity of noise (signal diagnosticity) in the first control experiment where subjects estimated the probability of blue regime in a task where there was no regime change (transition probability was 0). However, the fact that there were no regime shifts changed the nature of the task. Instead of always starting at the Red regime in the main experiment, in the first control experiment we randomly picked the regime to draw the signals from. It also changed the meaning and the dynamics of the signals (red and blue) that would appear. In the main experiment the blue signal is a signal consistent with change, but in the control experiment this is no longer the case. In the main experiment, the frequency of blue signals is contingent upon both noise and transition probability. In general, blue signals are less frequent than red signals because of small transition probabilities. But in the first control experiment, the frequency of blue signals may not be less frequent because the regime was blue in half of the trials. Due to these differences, we do not see how analyzing the control experiments could help in establishing robustness because we do not have a good prediction as to whether and how the neural selectivity would be impacted by these differences.

      It seems that the authors have not counterbalanced the colors and that subjects always reported the probability of the blue regime. If so, I'm wondering why this was not counterbalanced.

      We are aware of the reviewer’s concern. The first reason we did not do these (color counterbalancing and report blue/red regime balancing) was to not confuse the subjects in an already complicated task. Balancing these two variables also comes at the cost of sample size, which was the second reason we did not do it. Although we can elect to do these balancing at the between-subject level to not impact the task complexity, we could have introduced another confound that is the individual differences in how people respond to these variables. This is the third reason we were hesitant to do these counterbalancing.

      Reviewer #2 (Public review):

      Summary:

      This paper focuses on understanding the behavioral and neural basis of regime shift detection, a common yet hard problem that people encounter in an uncertain world.

      Using a regime-shift task, the authors examined cognitive factors influencing belief updates by manipulating signal diagnosticity and environmental volatility. Behaviorally, they have found that people demonstrate both over and under-reaction to changes given different combinations of task parameters, which can be explained by a unified system-neglect account. Neurally, the authors have found that the vmPFC-striatum network represents current belief as well as belief revision unique to the regime detection task. Meanwhile, the frontoparietal network represents cognitive factors influencing regime detection i.e., the strength of the evidence in support of the regime shift and the intertemporal belief probability. The authors further link behavioral signatures of system neglect with neural signals and have found dissociable patterns, with the frontoparietal network representing sensitivity to signal diagnosticity when the observation is consistent with regime shift and vmPFC representing environmental volatility, respectively. Together, these results shed light on the neural basis of regime shift detection especially the neural correlates of bias in belief update that can be observed behaviorally.

      Strengths:

      (1) The regime-shift detection task offers a solid ground to examine regime-shift detection without the potential confounding impact of learning and reward. Relatedly, the system-neglect modeling framework provides a unified account for both over or under-reacting to environmental changes, allowing researchers to extract a single parameter reflecting people's sensitivity to changes in decision variables and making it desirable for neuroimaging analysis to locate corresponding neural signals.

      Thank you for recognizing our task design and our system-neglect computational framework in understanding change detection.

      (2) The analysis for locating brain regions related to belief revision is solid. Within the current task, the authors look for brain regions whose activation covary with both current belief and belief change. Furthermore, the authors have ruled out the possibility of representing mere current belief or motor signal by comparing the current study results with two other studies. This set of analyses is very convincing.

      Thank you for recognizing our control studies in ruling out potential motor confounds in our neural findings on belief revision.

      (3) The section on using neuroimaging findings (i.e., the frontoparietal network is sensitive to evidence that signals regime shift) to reveal nuances in behavioral data (i.e., belief revision is more sensitive to evidence consistent with change) is very intriguing. I like how the authors structure the flow of the results, offering this as an extra piece of behavioral findings instead of ad-hoc implanting that into the computational modeling.

      Thank you for appreciating how we showed that neural insights can lead to new behavioral findings.

      Weaknesses:

      (1) The authors have presented two sets of neuroimaging results, and it is unclear to me how to reason between these two sets of results, especially for the frontoparietal network. On one hand, the frontoparietal network represents belief revision but not variables influencing belief revision (i.e., signal diagnosticity and environmental volatility). On the other hand, when it comes to understanding individual differences in regime detection, the frontoparietal network is associated with sensitivity to change and consistent evidence strength. I understand that belief revision correlates with sensitivity to signals, but it can probably benefit from formally discussing and connecting these two sets of results in discussion. Relatedly, the whole section on behavioral vs. neural slope results was not sufficiently discussed and connected to the existing literature in the discussion section. For example, the authors could provide more context to reason through the finding that striatum (but not vmPFC) is not sensitive to volatility.

      We thank the reviewer for the valuable suggestions.

      With regard to the first comment, we wish to clarify that we did not find frontoparietal network to represent belief revision. It was the vmPFC and ventral striatum that we found to represent belief revision (delta Pt in Fig. 3). For the frontoparietal network, we identified its involvement in our task through finding that its activity correlated with strength of change evidence (Fig. 4) and individual subjects’ sensitivity to signal diagnosticity (Fig. 5). Conceptually, these two findings reflect how individuals interpret the signals (signals consistent or inconsistent with change) in light of signal diagnosticity. This is because (1) strength of change evidence is defined as signals (+1 for signal consistent with change, and -1 for signal inconsistent with change) multiplied by signal diagnosticity and (2) sensitivity to signal diagnosticity reflects how individuals subjectively evaluate signal diagnosticity. At the theoretical level, these two findings can be interpreted through our computational framework in that both the strength of change evidence and sensitivity to signal diagnosticity contribute to estimating the likelihood of change (Eqs. 1 and 2). We added a paragraph in Discussion to talk about this.

      We added on p. 36:

      “For the frontoparietal network, we identified its involvement in our task through finding that its activity correlated with strength of change evidence (Fig. 4) and individual subjects’ sensitivity to signal diagnosticity (Fig. 5). Conceptually, these two findings reflect how individuals interpret the signals (signals consistent or inconsistent with change) in light of signal diagnosticity. This is because (1) strength of change evidence is defined as signals (+1 for signal consistent with change, and −1 for signal inconsistent with change) multiplied by signal diagnosticity and (2) sensitivity to signal diagnosticity reflects how individuals subjectively evaluate signal diagnosticity. At the theoretical level, these two findings can be interpreted through our computational framework in that both the strength of change evidence and sensitivity to signal diagnosticity contribute to estimating the likelihood of change (Equations 1 and 2 in Methods).”

      With regard to the second comment, we added a discussion on the behavioral and neural slope comparison. We pointed out previous papers conducting similar analysis (Vilares et al., 2011; Ting et al., 2015; Yang & Wu, 2020), their findings and how they relate to our results. Vilares et al. found that sensitivity to prior information (uncertainty in prior distribution) in the orbitofrontal cortex (OFC) and putamen correlated with behavioral measure of sensitivity to prior. In the current study, transition probability acts as prior in the system-neglect framework (Eq. 1) and we found that ventromedial prefrontal cortex represents subjects’ sensitivity to transition probability. Together, these results suggest that OFC (with vmPFC being part of OFC, see Wallis, 2011) is involved in the subjective evaluation of prior information in both static (Vilares et al., 2011) and dynamic environments (current study).

      We added on p. 37-38:

      “In the current study, our psychometric-neurometric analysis focused on comparing behavioral sensitivity with neural sensitivity to the system parameters (transition probability and signal diagnosticity). We measured sensitivity by estimating the slope of behavioral data (behavioral slope) and neural data (neural slope) in response to the system parameters. Previous studies had adopted a similar approach (Ting et al., 2015a; Vilares et al., 2012; Yang & Wu, 2020). For example, Vilares et al. (2012) found that sensitivity to prior information (uncertainty in prior distribution) in the orbitofrontal cortex (OFC) and putamen correlated with behavioral measure of sensitivity to the prior.

      In the current study, transition probability acts as prior in the system-neglect framework (Eq. 2 in Methods) and we found that ventromedial prefrontal cortex represents subjects’ sensitivity to transition probability. Together, these results suggest that OFC (with vmPFC being part of OFC, see Wallis, 2011) is involved in the subjective evaluation of prior information in both static (Vilares et al., 2012) and dynamic environments (current study). In addition, distinct from vmPFC in representing sensitivity to transition probability or prior, we found through the behavioral-neural slope comparison that the frontoparietal network represents how sensitive individual decision makers are to the diagnosticity of signals in revealing the true state (regime) of the environment.”

      (2) More details are needed for behavioral modeling under the system-neglect framework, particularly results on model comparison. I understand that this model has been validated in previous publications, but it is unclear to me whether it provides a superior model fit in the current dataset compared to other models (e.g., a model without \alpha or \beta). Relatedly, I wonder whether the final result section can be incorporated into modeling as well - i.e., the authors could test a variant of the model with two \betas depending on whether the observation is consistent with a regime shift and conduct model comparison.

      Thank you for the great suggestion. We rewrote the final Results section to specifically focus on model comparison. To address the reviewer’s suggestion (separately estimate beta parameters for change-consistent and change-inconsistent signals), we indeed found that these models were better than the original system-neglect model.

      To incorporate these new findings, we rewrote the entire final result section “Incorporating signal dependency into system-neglect model led to better models for regime-shift detection “(p.28-30).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Use line numbers for the next round of reviews.

      We added line numbers in the revised manuscript.

      (2) Figure 2b: Can the empirical results be reproduced by the system-neglect model? This would complement the analyses presented in Figure S4.

      Yes. We now add Figure S6 based on system-neglect model fits. For each subject, we first computed period-by-period probability estimates based on the parameter estimates of the system-neglect model. Second, we computed index of overreaction (IO) for each combination of transition probability and signal diagnosticity. Third, we plot the IO like we did using empirical results in Fig. 2b. We found that the empirical results in Fig. 2b are similar to the system-neglect model shown in Figure S6, indicating that the empirical results can be reproduced by the model.

      (3) Page 14: Instead of referring to the "Methods" in general, you could be more specific about where the relevant information can be found.

      Fixed. We changed “See Methods” to “See System-neglect model in Methods”.

      (4) Page 18: Consider avoiding the term "more significantly". Consider effect sizes if interested in comparing effects to each other.

      Fixed. On page 19, we changed that to

      “In the second analysis, we found that for both vmPFC and ventral striatum, the regression coefficient of 𝑃) was significantly different between Experiment 1 and Experiment 2 (Fig. 3C) and between Experiment 1 and Experiment 3 (Fig. 3D; also see Tables S5 and S6 in SI).”

      (5) Page 30: Cite key studies using reversal-learning paradigms. Currently, readers less familiar with the literature might have difficulties with this.

      We now cite key studies using reversal-learning paradigms on p.32:

      “Our work is closely related to the reversal-learning paradigm—the standard paradigm in neuroscience and psychology to study change detection (Fellows & Farah, 2003; Izquierdo et al., 2017; O'Doherty et al., 2001; Schoenbaum et al., 2000; Walton et al., 2010). In a typical reversal-learning task, human or animal subjects choose between two options that differ in the reward magnitude or probability of receiving a reward. Through reward feedback the participants gradually learn the reward contingencies associated with the options and have to update knowledge about reward contingencies when contingencies are switched in order to maximize rewards.”

      Reviewer #2 (Recommendations for the authors):

      (1) Some literature on change detection seems missing. For example, the author should also cite Muller, T. H., Mars, R. B., Behrens, T. E., & O'Reilly, J. X. (2019). Control of entropy in neural models of environmental state. elife, 8, e39404. This paper suggests that medial PFC is correlated with the entropy of the current state, which is closely related to regime change and environmental volatility.

      Thank you for pointing to this paper. We have now added it and other related papers in the Introduction and Discussion.

      In Introduction, we added on p.5-6:

      “Different behavioral paradigms, most notably reversal learning, and computational models were developed to investigate its neurocomputational substrates (Behrens et al., 2007; Izquierdo et al., 2017; Payzan-LeNestour et al., 2011, 2013; Nasser et al., 2010; McGuire et al., 2014; Muller et al., 2019). Key findings on the neural implementations for such learning include identifying brain areas and networks that track volatility in the environment (rate of change) (Behrens et al., 2007), the uncertainty or entropy of the current state of the environment (Muller et al., 2019), participants’ beliefs about change (Payzan-LeNestour et al., 2011; McGuire et al., 2014; Kao et al., 2020), and their uncertainty about whether a change had occurred (McGuire et al., 2014; Kao et al., 2020).”

      In Discussion (p.35), we added a new paragraph:

      “Related to OFC function in decision making and reinforcement learning, Wilson et al. (2014) proposed that OFC is involved in inferring the current state of the environment. For example, medial OFC had been shown to represent probability distribution on possible states of the environment (Chan et al., 2016), the current task state (Schuck et al., 2016) and uncertainty or entropy associated with the state of the environment (Muller et al., 2019). In the context of regime-shift detection, regimes can be regarded as states of the environment and therefore a change in regime indicates a change in the state of the environment. Muller et al. (2019) found that in dynamic environments where changes in the state of the environment happen regularly, medial OFC represented the level of uncertainty in the current state of the environment. Our finding that vmPFC represented individual participants’ probability estimates of regime shifts suggest that vmPFC and/or OFC are involved in inferring the current state of the environment through estimating whether the state has changed. Our finding that vmPFC represented individual participants’ sensitivity to transition probability further suggest that vmPFC and/or OFC contribute to individual participants’ biases in state inference (over- and underreactions to change) in how these brain areas respond to the volatility of the environment.”

      (2) The language used when describing the selective relationship between frontoparietal network activation and change-consistent signal can be clearer. When describing separating those two signals, the authors refer to them as when the 'blue' signal shows up and when the 'red' signal shows up, assuming that the current belief state is blue. This is a little confusing cuz it is hard to keep in mind what is the default color in this example. It would be more intuitive if the author used language such as the 'change consistent' signal.

      Thank you for the suggestion. We have changed the wording according to your suggestion. That is, we say ‘change-consistent (blue) signals’ and ‘change-inconsistent (red) signals’ throughout pages 22-28.

      (3) Figure 4B highlights dmPFC. However, in the associated text, it says p = .10 so it is not significant. To avoid misleading readers, I would recommend pointing this out explicitly beyond saying 'most brain regions in the frontoparietal network also correlated with the intertemporal prior'.

      Thank you for pointing this out. We now say on p.20

      “With independent (leave-one-subject-out, LOSO) ROI analysis, we examined whether brain regions in the frontoparietal network (shown to represent strength of change evidence) correlated with intertemporal prior and found that all brain regions, with the exception of dmPFC, in the frontoparietal network correlated with the intertemporal prior.”

      (4) There is a full paragraph in the discussion talking about the central opercular cortex, but this terminology has not shown up in the main body of the paper. If this is an important brain region to the authors, I would recommend mentioning it more often in the result section.

      Thank you for this suggestion. We have now added central opercular cortex in the Results section (p.18):

      “For 𝑃<sub>𝑡</sub>, we found that the ventromedial prefrontal cortex (vmPFC) and ventral striatum correlated with this behavioral measure of subjects’ belief about change. In addition, many other brain regions, including the motor cortex, central opercular cortex, insula, occipital cortex, and the cerebellum also significantly correlated with 𝑃<sub>𝑡</sub>.”

      (5) The authors have claimed that people make more extreme estimates under high diagnosticity (Supplementary Figure 1). This is an interesting point because it seems to be different from what is shown in the main graph where it seems that people are not extreme enough compared to an ideal Bayesian observer. I understand that these are effects being investigated under different circumstances. It would be helpful if for Supplementary Figure 1 the authors could overlay, or generate a different figure showing what an ideal Bayesian observer would do in this situation.

      We thank the reviewer for pointing this out. We wish to clarify that when we said “more extreme estimates under high diagnosticity” we meant compared with low diagnosticity and not with the ideal Bayesian observer. We clarified this point by rephrasing our sentence on p.11:

      “We also found that subjects tended to give more extreme Pt under high signal diagnosticity than low diagnosticity (Fig. S1 in Supplementary Information, SI).”

      When it comes to comparing subjects’ probability estimates with the normative Bayesian, subjects tended to “underreact” under high diagnosticity. This can be seen in Fig. 4B, which shows a trend of increasing underreaction (or decreasing overreaction) as diagnosticity increased (row-wise comparison for a given transition probability).

      We see the reviewer’s point in overlaying the Bayesian on Fig. S1 and update it by adding the normative Bayesian in orange.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Silbaugh, Koster, and Hansel investigated how the cerebellar climbing fiber (CF) signals influence neuronal activity and plasticity in mouse primary somatosensory (S1) cortex. They found that optogenetic activation of CFs in the cerebellum modulates responses of cortical neurons to whisker stimulation in a cell-type-specific manner and suppresses potentiation of layer 2/3 pyramidal neurons induced by repeated whisker stimulation. This suppression of plasticity by CF activation is mediated through modulation of VIP- and SST-positive interneurons. Using transsynaptic tracing and chemogenetic approaches, the authors identified a pathway from the cerebellum through the zona incerta and the thalamic posterior medial (POm) nucleus to the S1 cortex, which underlies this functional modulation.

      Strengths:

      This study employed a combination of modern neuroscientific techniques, including two-photon imaging, opto- and chemo-genetic approaches, and transsynaptic tracing. The experiments were thoroughly conducted, and the results were clearly and systematically described. The interplay between the cerebellum and other brain regions - and its functional implications - is one of the major topics in this field. This study provides solid evidence for an instructive role of the cerebellum in experience-dependent plasticity in the S1 cortex.

      Weaknesses:

      There may be some methodological limitations, and the physiological relevance of the CFinduced plasticity modulation in the S1 cortex remains unclear. In particular, it has not been elucidated how CF activity influences the firing patterns of downstream neurons along the pathway to the S1 cortex during stimulation.

      Our study addresses the important question of whether CF signaling can influence the activity and plasticity of neurons outside the olivocerebellar system, and further identifies the mechanism through which this indeed occurs. We provide a detailed description of the involvement of specific neuron subtypes and how they are modulated by climbing fiber activation to impact S1 plasticity. We also identify at least one critical pathway from the cerebellar output to the S1 circuit. It is indeed correct that we did not investigate how the specific firing patterns of all of these downstream neurons are affected, or the natural behaviors in which this mechanism is involved. Now that it is established that CF signaling can impact activity and plasticity outside the olivocerebellar system -- and even in the primary somatosensory cortex -- these questions will be important to further investigate in future studies.

      (1) Optogenetic stimulation may have activated a large population of CFs synchronously, potentially leading to strong suppression followed by massive activation in numerous cerebellar nuclear (CN) neurons. Given that there is no quantitative estimation of the stimulated area or number of activated CFs, observed effects are difficult to interpret directly. The authors should at least provide the basic stimulation parameters (coordinates of stim location, power density, spot size, estimated number of Purkinje cells included, etc.).

      As discussed in the paper, we indeed expect that synchronous CF activation is needed to allow for an effect on S1 circuits under natural or optogenetic activation conditions. The basic optogenetic stimulation parameters (also stated in the methods) are as follows: 470 nm LED; Ø200 µm core, 0.39 NA rotary joint patch cable; absolute power output of 2.5 mW; spot size at the surface of the cortex 0.6 mm; estimated power density 8 mW/mm2. A serious estimate of the number of Purkinje cells that are activated is difficult to provide, in particular as ‘activation’ would refer to climbing fiber inputs, not Purkinje cells directly.

      (2) There are CF collaterals directly innervating CN (PMID:10982464). Therefore, antidromic spikes induced by optogenetic stimulation may directly activate CN neurons. On the other hand, a previous study reported that CN neurons exhibit only weak responses to CF collateral inputs (PMID: 27047344). The authors should discuss these possibilities and the potential influence of CF collaterals on the interpretation of the results.

      A direct activation of CN neurons by antidromic spikes in CF collaterals cannot be ruled out. However, we believe that this effect will not be substantial. The activation of the multi-synaptic pathway that we describe in this study is more likely to require a strong nudge as resulting from synchronized Purkinje cell input and subsequent rebound activation in CN neurons (PMID: 22198670), rather than small-amplitude input provided by CF collaterals (PMID: 27047344). A requirement for CF/PC synchronization would also set a threshold for activation of this suppressive pathway.

      (3) The rationale behind the plasticity induction protocol for RWS+CF (50 ms light pulses at 1 Hz during 5 min of RWS, with a 45 ms delay relative to the onset of whisker stimulation) is unclear.

      a) The authors state that 1 Hz was chosen to match the spontaneous CF firing rate (line 107); however, they also introduced a delay to mimic the CF response to whisker stimulation (line 108). This is confusing, and requires further clarification, specifically, whether the protocol was designed to reproduce spontaneous or sensory-evoked CF activity.

      This protocol was designed to mimic sensory-evoked CF activity as reported in Bosman et al (J. Physiol. 588, 2010; PMID: 20724365).

      b) Was the timing of delivering light pulses constant or random? Given the stochastic nature of CF firing, randomly timed light pulses with an average rate of 1Hz would be more physiologically relevant. At the very least, the authors should provide a clear explanation of how the stimulation timing was implemented.

      Light pulses were delivered at a constant 1 Hz. Our goal was to isolate synchrony as the variable distinguishing sensory-evoked from spontaneous CF activity; additionally varying stochasticity, rate, or amplitude would have confounded this. Future studies could explore how these additional parameters shape S1 responses.

      (4) CF activation modulates inhibitory interneurons in the S1 cortex (Figure 2): responses of interneurons in S1 to whisker stimulation were enhanced upon CF coactivation (Figure 2C), and these neurons were predominantly SST- and PV-positive interneurons (Figure 2H, I). In contrast, VIP-positive neurons were suppressed only in the late time window of 650-850 ms (Figure 2G). If the authors' hypothesis-that the activity of VIP neurons regulates SST- and PVneuron activity during RWS+CF-is correct, then the activity of SST- and PV-neurons should also be increased during this late time window. The authors should clarify whether such temporal dynamics were observed or could be inferred from their data.

      Yes, we see a significant activity increase in PV neurons in this late time window (see updates to Data S2). Activity was also increased in SST neurons, though this did not reach statistical significance (Data S2). One reason might be that – given the small effect size overall – such an effect would only be seen in paired recordings. Chemogenetic activity modulation in VIP neurons, which provides a more crude test, shows, however, that SST- and PV-positive interneurons are indeed regulated via inhibition from VIP-positive interneurons (Fig. 5).

      (5) Transsynaptic tracing from CN nicely identified zona incerta (ZI) neurons and their axon terminals in both POm and S1 (Figure 6 and Figure S7).

      a) Which part of the CN (medial, interposed, or lateral) is involved in this pathway is unclear.

      We used a dual-injection transsynaptic tracing approach to specifically label the outputs of ZI neurons that receive input from the deep cerebellar nuclei. The anterograde viral vector injected into the CN is unlabeled (no fluorophore) and therefore, it is not possible to reliably assess the extent of viral spread in those experiments as performed. However, we have previously performed similar injections into the deep cerebellar nuclei and post hoc histology suggest all three nuclei will have at least some viral expression (Koster and Sherman, 2024). Due to size and injection location, we will mostly have reached the lateral (dentate) nuclei, but cannot exclude partial transsynaptic tracing from the interposed and medial nuclei.  

      b) Were the electrophysiological properties of these ZI neurons consistent with those of PV neurons?

      Although most recorded cells demonstrated electrophysiological properties consistent with PV+ interneurons in other brain regions (i.e. fast spiking, narrow spike width, non-adapting; see Tremblay et al., 2016), interneuron subtypes in the ZI have been incompletely characterized, with SST+ cells showing similar features to those typically associated with PV+ cells (if interested, compare Fig. 4 in DOI: 10.1126/sciadv.abf6709 vs. Fig. S10 in https://doi.org/10.1016/j.neuron.2020.04.027). Therefore, we did not attempt to delineate cell identity based on these characteristics.

      c) There appears to be a considerable number of axons of these ZI neurons projecting to the S1 cortex (Figure S7C). Would it be possible to estimate the relative density of axons projecting to the POm versus those projecting to S1? In addition, the authors should discuss the potential functional role of this direct pathway from the ZI to the S1 cortex.

      An absolute quantification is difficult to provide based on the images that we obtained. However, any crude estimate would indicate the relative density of projections to POm is higher than the density of projections to S1 (this is apparent from the images themselves). While the anatomical and functional connections from POm to S1 have been described in detail (Audette et al., 2018), this is not the case for the direct projections to ZI. A direct ZI to S1 projection would potentially involve a different recruitment of neurons in the S1 circuit. Any discussion on the specific consequences of the activation of this direct pathway would be purely speculative.

      Reviewer #2 (Public review):

      Summary:

      The authors examined long-distance influence of climbing fiber (CF) signaling in the somatosensory cortex by manipulating whiskers through stimulation. Also, they examined CF signaling using two-photon imaging and mapped projections from the cerebellum to the somatosensory cortex using transsynaptic tracing. As a final manipulation, they used chemogenetics to perturb parvalbumin-positive neurons in the zona incerta and recorded from climbing fibers.

      Strengths:

      There are several strengths to this paper. The recordings were carefully performed, and AAVs used were selective and specific for the cell types and pathways being analyzed. In addition, the authors used multiple approaches that support climbing fiber pathways to distal regions of the brain. This work will impact the field and describes nice methods to target difficult-to-reach brain regions, such as the inferior olive.

      Weaknesses:

      There are some details in the methods that could be explained further. The discussion was very short and could connect the findings in a broader way.

      In the revised manuscript, we provide more methodological details, as requested. We provided as simple as possible explanations in the discussion, so as not to bias further investigations into this novel phenomenon. In particular, we avoid an extended discussion of the gating effect of CF activity on S1 plasticity. While this is the effect on plasticity specifically observed here, we believe that the consequences of CF signaling on S1 activity may entirely depend on the contexts in which CF signals are naturally recruited, the ongoing activity of other brain regions, and behavioral state. Our key finding is that such modulation of neocortical plasticity can occur. How CF signaling controls plasticity of the neocortex in all contexts remains unknown, but needs to be thoughtfully tested in the future.

      Reviewer #3 (Public review):

      Summary:

      The authors developed an interesting novel paradigm to probe the effects of cerebellar climbing fiber activation on short-term adaptation of somatosensory neocortical activity during repetitive whisker stimulation. Normally, RWS potentiated whisker responses in pyramidal cells and weakly suppressed them in interneurons, lasting for at least 1h. Crusii Optogenetic climbing fiber activation during RWS reduced or inverted these adaptive changes. This effect was generally mimicked or blocked with chemogenetic SST or VIP activation/suppression as predicted based on their "sign" in the circuit.

      Strengths:

      The central finding about CF modulation of S1 response adaptation is interesting, important, and convincing, and provides a jumping-off point for the field to start to think carefully about cerebellar modulation of neocortical plasticity.

      Weaknesses:

      The SST and VIP results appeared slightly weaker statistically, but I do not personally think this detracts from the importance of the initial finding (if there are multiple underlying mechanisms, modulating one may reproduce only a fraction of the effect size). I found the suggestion that zona incerta may be responsible for the cerebellar effects on S1 to be a more speculative result (it is not so easy with existing technology to effectively modulate this type of polysynaptic pathway), but this may be an interesting topic for the authors to follow up on in more detail in the future.

      Our interpretation of the anatomical and physiological findings is that a pathway via the ZI is indeed critical for the observed effects. This pathway also represents perhaps the most direct pathway (i.e. least number of synapses connecting the cerebellar nuclei to S1). However, several other direct and indirect pathways are plausible as well and we expect distinct activation requirements and consequences for neurons in the S1 circuit. These are indeed interesting topics for future investigation.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Line 77: "CF transients" is not a standard or widely recognized term. Please use a more precise expression, such as "CF-induced calcium transients."

      We now avoid the use of the term “CF transients” and replaced it with “CF-induced calcium transients.”

      (2) Titer of AAVs injected should be provided.

      AAV titers have been included in an additional data table (Data S9).

      (3) Several citations to the figures are incorrect (for example, "Supplementary Data 2a (Line 398)" does not exist).

      We apologize for the mistakes in this version of the article. Incorrect citations to the figures have been corrected.

      (4) Line 627-628: "The tip of the patch cable was centered over Crus II in all optogenetic stimulation experiments." The stereotaxic coordinate of the tip position should be provided.

      The stereotaxic coordinate of the tip position has been provided in the methods.

      (5) Line 629: "Blue light pulses were delivered with a 470 nm Fiber-Coupled LED (Thorlabs catalog: M470F3)." The size of the light stim and estimated power density (W/mm^2) at the surface of the cortex should be provided.

      The spot size and estimated power density at the surface of the cortex has been provided in the methods.

      (6) Line 702-706: References for DCZ should be cited.

      We now cited Nagai et al, Nat. Neurosci. 23 (2020) as the original reference.

      (7) Two-photon image processing (Line 807-809): The rationale for normalizing ∆F/F traces to a pre-stimulus baseline is unclear because ∆F/F is, by definition, already normalized to baseline fluorescence: (Ft-F0)/F0. The authors should clarify why this additional normalization step was necessary and how it affected the interpretation of the data.

      A single baseline fluorescence value (F₀) was computed for each neuron across the entire recording session, which lasted ~120-minutes. However, some S1 neurons exhibit fluctuations in baseline fluorescence over time—often related to locomotive activity or spontaneous network oscillations—which can obscure stimulus-evoked changes. To isolate fluorescence changes specifically attributable to whisker stimulation, we normalized each ∆F/F trace to the prestimulus baseline for that trial. This additional normalization allowed us to quantify potentiation or depression of sensory responses themselves, independently of spontaneous oscillations or locomotion-related changes in the ongoing neural activity.

      Reviewer #2 (Recommendations for the authors):

      (1) Did the climbing fiber stimulation for Figure 1 result in any changes to motor activity? Can you make any additional comments on other behaviors that were observed during these manipulations?

      Acute CF stimulation did not cause any changes in locomotive or whisking activity. The CF stimulation also did not influence the overall level of locomotion or whisking during plasticity induction.

      (2) Figure 3B and F- it is very difficult to see the SST+ neurons. Can this be enhanced?

      We linearly adjusted the brightness and contrast for the bottom images in Figure 3B and F to improve visualization of SST+ neurons. Note the expression of both hM3D(Gq) and hM4D(Gi) in SST+ neurons is sparse, which was necessary to avoid off-target effects.

      (3) Can you be more specific about the subregions of cerebellar nuclei and cell types that are targeted in the tracing studies? Discussions of the cerebellar nuclei subregions are missing and would be interesting, as others have shown discrete pathways between cerebellar nuclei subregions and long-distance projections.

      See our response to comment 5a from Reviewer 1 (copied again here): we used a dual-injection transsynaptic tracing approach to specifically label the outputs of ZI neurons that receive input from the deep cerebellar nuclei. The anterograde viral vector injected into the CN is unlabeled (no fluorophone) and therefore, it is not possible to reliably assess the extent of viral spread in those experiments as performed. However, we have previously performed similar injections into the deep cerebellar nuclei and post hoc histology suggest all three nuclei will have at least some viral expression (Koster and Sherman, 2024). Due to size and injection location, we will mostly have reached the lateral (dentate) nuclei, but cannot exclude partial transsynaptic tracing from the interposed and medial nuclei.  

      It would indeed be interesting to further investigate the effect of CFs residing in different cerebellar lobules, which preferentially target different cerebellar nuclei, on targets of these nuclei.

      (4) Did you see any connection to the ventral tegmental area? Can you comment on whether dopamine pathways are influenced by CF and in your manipulations?

      We did not specifically look at these pathways and thus are not able to comment on this.

      (5) These are intensive surgeries, do you think glia could have influenced any results?

      This was not tested and seems unlikely, but we cannot exclude such possibility.

      (6) It is unclear in the methods how long animals were recorded for in each experiment. Can you add more detail?

      Additional detail was added to the methods. Recordings for all experimental configurations did not last more than 120 minutes in total. All data were analyzed across identical time windows for each experiment.

      (7) In the methods it was mentioned that recording length can differ between animals. Can this influence the results, and if so, how was that controlled for?

      There was a variance in recording length within experimental groups, but no systematic difference between groups.

      (8) I do not see any mention of animal sex throughout this manuscript. If animals were mixed groups, were sex differences considered? Would it be expected that CF activity would be different in male and female mice?

      As mentioned in the Methods (Animals), mice of either sex were used. No sex-dependent differences were observed.

      (9) Transsynaptic tracing results of the zona incerta are very interesting. The zona incerta is highly understudied, but has been linked to feeding, locomotion, arousal, and novelty seeking. Do you think this pathway would explain some of the behavioral results found through other studies of cerebellar lobule perturbations? Some discussion of how this brain region would be important as a cerebellar connection in animal behavior would be interesting.

      Since the multi-synaptic pathway from the cerebellum to S1 involves several brain regions with their own inputs and modulatory influences, it seems plausible to assume that behaviors controlled by these regions or affecting signaling pathways that regulate them would show some level of interaction. Our study does not address these interactions, but this will be an interesting question to be addressed in future work.

      Reviewer #3 (Recommendations for the authors):

      General comments on the data presentation:

      I'm not a huge fan of taking areas under curves ('AUC' throughout the study) when the integral of the quantity has no physical meaning - 'normalizing' the AUC (1I,L etc) is even stranger, because of course if you instead normalize the AUC by the # of data points, you literally just get the mean (which is probably what should be used instead).

      Indeed, AUC is equal to the average response in the time window used, multiplied by the window duration (thus, AUC is directly proportional to the mean). We choose to report AUC, a descriptive statistic, rather than the mean within this window. In 1I and L, we normalize the AUC across animals, essentially removing the variability across animals in the ‘Pre’ condition for visualization. Note the significance of these comparisons are consistent whether or not we normalize to the ‘Pre’ condition (non-normalized RWS data in I shows a significant increase in PN activity, p = 0.0068, signrank test; non-normalized RWS+CF data in I shows a significant decrease in PN activity, p = 0.0135, paired t-test; non-normalized RWS data in L shows a significant decrease in IN activity, p <0.001, paired t-test; non-normalized RWS+CF data in L shows no significant change in IN activity, p = 0.7789, paired t-test).

      I think unadorned bar charts are generally excluded from most journals now. Consider replacing these with something that shows the raw datapoints if not too many, or the distribution across points.

      We have replaced bar charts with box plots and violin plots. We have avoided plotting individual data points due to the quantity of points.

      In various places, the statistics produce various questionable outcomes that will draw unwanted reader scrutiny. Many of the examples below involve tiny differences in means with overlapping error bars that are "significant" or a few cases of nonoverlapping error bars that are "not significant." I think replacing the bar charts may help to resolve things here if we can see the whole distribution or the raw data points. As importantly, I think a big problem is that the statistical tests all seem to be nonparametric (they are ambiguously described in Table S3 as "Wilcoxon," which should be clarified, since there is an unpaired Wilcoxon test [rank sum] and a paired Wilcoxon test [sign rank]), and thus based on differences in the *median* whereas the bar charts are based on the *mean* (and SEM rather than MAD or IQR or other medianappropriate measure of spread). This should be fixed (either change the test or change the plots), which will hopefully allay many of the items below.

      We thank the reviewer for this important point. As mentioned in the Statistics and quantification section, Wilcoxon signed rank tests were used for non-normal data. We have replaced the bar charts with box plots which show the IQR and median, which indeed allays may of the items below.

      Here are some specific points on the statistics presentation:

      (1) 1G, the test says that following RWS+CF, the decrease in PN response is not significant. In 1I, the same data, but now over time, shows a highly significant decrease. This probably means that either the first test should be reconsidered (was this a paired comparison, which would "build in" the normalization subsequently used automatically?) or the second test should be reconsidered. It's especially strange because the n value in G, if based on cells, would seem to be ~50-times higher than that in I if based on mice.

      In Figure 1G, the analysis tests whether individual pyramidal neurons significantly changed their responses before vs. after RWS+CF stimulation. This is a paired comparison at the single-cell level, and here indicates that the average per-neuron response did not reliably decrease after RWS+CF when comparing each cell’s pre- and post-values directly. In contrast, Figure 1I examines the same dataset analyzed across time bins using a two-way ANOVA, which tests for effects of time, group (RWS vs. RWS+CF), and their interaction. The analysis showed a significant group effect (p < 0.001), indicating that the overall level of activity across all time points differed between RWS and RWS+CF conditions. The difference in significance between these two analyses arises because the first test (Fig. 1G) assesses within-neuron changes (paired), whereas the second test (Fig. 1I) assesses overall population-level differences between groups over time (independent groups). Thus, the tests address related but distinct questions—one about per-cell response changes, the other about how activity differs across experimental conditions.

      (2) 1J RWS+CF then shows a much smaller difference with overlapping error bars than the ns difference with nonoverlapping errors in 1G, but J gets three asterisks (same n-values).

      Bar graphs have been replaced with box plots.

      (3) 1K, it is very unclear what is under the asterisk could possibly be significant here, since the black and white dots overlap and trade places multiple times.

      See response to point 1. A significant group effect will exist if the aggregate difference across all time bins exceeds within-group variability. The asterisk therefore reflects a statistically significant main group effect (RWS versus RWS+CF) rather than differences at any single time point. Note, however, the very small effect size here.

      (4) 2B, 2G, 2H, 2I, 3G, 3H, 5C etc, again, significance with overlapping error bars, see suggestions above.

      Bar graphs have been replaced with box plots.

      (5) Time windows: e.g., L149-153 / 2B - this section reads weirdly. I think it would be less offputting to show a time-varying significance, if you want to make this point (there are various approaches to this floating around), or a decay rate, or something else.

      Here, we wanted to understand the overall direction of influence of CFs on VIP activity. We find that CFs exert a suppressive effect on VIP activity, which is statistically significant in this later time window. The specific effect of CF modulation on the activity of S1 neurons across multiple time points will be described in more detail in future investigations.

      (6) 4G, 6I, these asterisks again seem impossible (as currently presented).

      Bar graphs have been replaced with box plots.

      The writing is in generally ok shape, but needs tightening/clarifying:

      (1) L45 "mechanistic capacity" not clear.

      We have simplified this term to “capacity.” We use the term here to express that the central question we pose is whether CF signals are able to impact S1 circuits. We demonstrate CF signals indeed influence S1 circuits and further describe the mechanism through which this occurs, but we do not yet know all of the natural conditions in which this may occur. We feel that “capacity” describes the question we pose -- and our findings -- very well.

      (2) L48-58 there's a lot of material here, not clear how much is essential to the present study.

      We would like to give an overview of the literature on instructive CF signaling within the cerebellum. Here, we feel it is important to describe how CFs supervise learning in the cerebellum via coincident activation of parallel fiber inputs and CF inputs. Our results demonstrate CFs have the capacity to supervise learning in the neocortex in a similar manner, as coincident CF activation with sensory input modulates plasticity of S1 neurons.

      (3) L59 "has the capacity to" maybe just "can".

      This has been adopted. We agree that “can” is a more straightforward way of saying “has the capacity to” here. In this sentence, “can” and “has the capacity to” both mean a general ability to do something, without explicit knowledge about the conditions of use.

      (4) L61-62 some of this is circular "observation that CF regulates plasticity in S1..has consequences for plasticity in S1".

      We now changed this to read “…consequences for input processing in S1.”

      (5) L91 "already existing whisker input" although I get it, strictly speaking, not clear what this means.

      This sentence has been reworded for clarity.

      (6) L94 "this form of plasticity" what form?

      Edited to read “sensory-evoked plasticity.”

      (7) L119 should say "to test the".

      This has been corrected.

      (8) L120 should say "well-suited to measure receptive fields".

      We agree; this wording has been adopted.

      (9) L130 should say "optical imaging demonstrated that receptive field".

      This has been adopted.

      (10) L138, the disclaimer is helpful, but wouldn't it be less confusing to just pick a different set of terms? Response potentiation etc.

      Perhaps, but we want to stress that components of LTP and LTD (traditionally tested using electrophysiological methods to specifically measure synaptic gain changes) can be optically measured as long as it is specified what is recorded.

      (11) L140, this whole section is not very clear. What was the experiment? What was done and how?

      The text in this section has been updated.

      (12) L154, 156, 158, 160, 960, what is a "basic response"? Is this supposed to contrast with RWS? If so, I would just say "we measured the response to whisker stimulation without first performing RWS, and compared this to the whisker stimulation with simultaneous CF activation."

      What we meant by “basic response” was the acute response of S1 neurons to a single 100 ms air puff. Here, we indeed measured the acute responses of S1 neurons to whisker stimulation (100 ms air puff) and compared them to whisker stimulation with simultaneous CF activation (100 ms air puff with a 50 ms light pulse; the light pulse was delayed 45 ms with respect to the air puff). This paragraph has been reworded for clarity.

      (13) L156 "comprised of a majority" unclear. You mean most of the nonspecific IN group is either PV or SST?

      Yes, that was meant here. This paragraph has been reworded for clarity.

      (14) L165 tense. "are activated" "we tested" prob should be "were activated."

      This sentence was reworded.

      (15) L173 Not requesting additional experiments, but demonstrating that the effect is mimicked by directly activating SST or suppressing VIP questions the specificity of CF activation per se, versus presumably many other pathways upstream of the same mechanisms, which might be worth acknowledging in the text.

      We indeed observe that directly activating SST or suppressing VIP neurons in S1 is sufficient to mediate the effect of CF activation on S1 pyramidal neurons, implicating SST and VIP neurons as the local effectors of CF signaling. In the text, we wrote “...the notion of sufficiency does not exclude potential effects of plasticity processes elsewhere that might well modulate effector activation in this context and others not yet tested.” Here, we mean that CFs are certainly not the only modulators of the inhibitory network in S1. One example we highlight in the discussion is that projections from M1 are known to modulate this disinhibitory VIP-to-SST-to-PN microcircuit in S1. We conclude from our chemogenetic manipulation experiments that CFs ultimately have the capacity to modulate S1 interneurons, which must occur indirectly (either through the thalamus or “upstream” regions as this reviewer points out). The fact that many other brain regions may also modulate the interneuron network in S1 -- or be modulated by CF activity themselves -- only expands the capacity of CFs to exert a variety of effects on S1 neurons in different contexts.

      (16) L247 "induced ChR2" awkward.

      We changed this to read “we expressed ChR2.”

      (17) 6C, what are the three colors supposed to represent?

      We apologize for the missing labels in this version of the manuscript. Figure 6C and the figure legend have been updated.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The study aims to determine the role of Slit-Robo signaling in the development and patterning of cardiac innervation, a key process in heart development. Despite the well-studied roles of Slit axon guidance molecules in the development of the central nervous system, their roles in the peripheral nervous system are less clear. Thus, the present study addresses an important question. The study uses genetic knockout models to investigate how Slit2, Slit3, Robo1, and Robo2 contribute to cardiac innervation.

      Using constitutive and cell type-specific knockout mouse models, they show that the loss of endothelial-derived Slit2 reduces cardiac innervation. Additionally, Robo1 knockout, but not Robo2 knockout, recapitulated the Slit2 knockout effect on cardiac innervation, leading to the conclusion that Slit2-Robo1 signaling drives sympathetic innervation in the heart. Finally, the authors also show a reduction in isoproterenol-stimulated heart rate but not basal heart rate in the absence of endothelial Slit2.

      The conclusions of this paper are mostly well supported by the data, but some should be modified to account for the study's limitations and discussed in the context of previous literature.

      We would like to thank the reviewer for their positive evaluation of our manuscript and in response to the reviewer’s comments we have extended the discussion as indicated below.

      (1) It is well established that Slit ligands undergo proteolytic cleavage, generating N- and C-terminal fragments with distinct biological functions. Full-length Slit proteins and their fragments differ in cell association, with the N-terminal fragment typically remaining membrane-bound, while the C-terminal fragment is more diffusible. This distinction is crucial when evaluating the role of Slit proteins secreted by different cell types in the heart. However, this study does not examine or discuss the specific contributions of different Slit2 fragments, limiting its mechanistic insight into how Slit2 regulates cardiac innervation.

      This is a valid point and it will be of interest for future studies to investigate the specific effects of the full length versus N- and C-terminal fragments in the context of cardiac innervation development. We have updated our discussion with a clearer reference to the proteolytic cleavage of Slit2.

      (2) The endothelial-specific deletion of Slit2 leads to its loss in endothelial cells across various organs and tissues in the developing embryo. Therefore, the phenotypes observed in the heart may be influenced by defects in other parts of the embryo, such as the CNS or sympathetic ganglia, and this possibility cannot be ruled out.

      We agree and we have now added this possibility to the discussion.

      Reviewer #2 (Public review):

      The aims of investigating Slit-Robo signaling in cardiac innervation were achieved by the experiments designed. While questions remain regarding signal regulation and interplay between established axon guidance signals and further role of other Slit ligands and Robo expression in endothelium, the results strongly support the conclusions drawn.

      Writing and presentation are easy to follow and well structured, Appropriate controls are used, statistical analysis applied appropriately, and experiments directly test aims following a logical story.

      The authors demonstrate a novel mechanism for Slit-Robo signaling in cardiac sympathetic innervation. The data establishes a framework for future studies.

      We would like to thank the reviewer for these positive comments.

      Recommendations:

      Further assessment of interplay between Slit ligands as well as other signaling pathways (Semaphorin, NGF, etc) could be investigated. Is it possible to rescue the phenotype by modulation of other signaling pathways? Can combined Slit2/Slit3 KO rescue? Additionally, as the authors state, conditional Robo1 knockouts will be important to validate the findings of constitutive knockout.

      Our study has provided the first data on the role of Slit-Robo signalling during cardiac innervation development and a base for exploring the interesting further questions the reviewer raises.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      There is a typo on line 83 (disease).

      This has been corrected.

    1. Author response:

      The following is the authors’ response to the original reviews

      We would like to thank all reviewers for their constructive and in-depth reviews. Thanks to your feedback, we realized that the main objective of the paper was not presented clearly enough, and that our use of the same “modality-agnostic” terminology for both decoders and representations caused confusion. We addressed these two major points as outlined in the following. 

      In the revised manuscript, we highlight that the main contribution of this paper is to introduce modality-agnostic decoders. Apart from introducing this new decoder type, we put forward their advantages in comparison to modality-specific decoders in terms of decoding performance and analyze the modality-invariant representations (cf. updated terminology in the following paragraph) that these decoders rely on. The dataset that these analyses are based on is released as part of this paper, in the spirit of open science (but this dataset is only a secondary contribution for our paper). 

      Regarding the terminology, we clearly define modality-agnostic decoders as decoders that are trained on brain imaging data from subjects exposed to stimuli in multiple modalities. The decoder is not given any information on which modality a stimulus was presented in, and is therefore trained to operate in a modality-agnostic way. In contrast, modality-specific decoders are trained only on data from a single stimulus modality. These terms are explained in Figure 2. While these terms describe different ways of how decoders can be trained, there are also different ways to evaluate them afterwards (see also Figure 3); but obviously, this test-time evaluation does not change the nature of the decoder, i.e., there is no contradiction in applying a modality-specific decoder to brain data from a different modality.

      Further, we identify representations that are relevant for modality-agnostic decoders using the searchlight analysis. We realized that our choice of using the same “modality-agnostic” term to describe these brain representations created unnecessary debate and confusion. In order to not conflate the terminology, in the updated manuscript we call these representations modality-invariant (and the opposite modality-dependent). Our methodology does not allow us to distinguish whether certain representations merely share representational structure to a certain degree, or are truly representations that abstract away from any modality-dependent information. However, in order to be useful for modality-agnostic decoding, a significant degree of shared representational structure is sufficient, and it is this property of brain representations that we now define as “modality-invariant”. 

      We updated the manuscript in line with this new terminology and focus: in particular, the first Related Work section on Modality-invariant brain representations, as well as the Introduction and Discussion.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors introduce a densely-sampled dataset where 6 participants viewed images and sentence descriptions derived from the MS Coco database over the course of 10 scanning sessions. The authors further showcase how image and sentence decoders can be used to predict which images or descriptions were seen, using pairwise decoding across a set of 120 test images. The authors find decodable information widely distributed across the brain, with a left-lateralized focus. The results further showed that modality-agnostic models generally outperformed modality-specific models, and that data based on captions was not explained better by caption-based models but by modality-agnostic models. Finally, the authors decoded imagined scenes.

      Strengths:

      (1) The dataset presents a potentially very valuable resource for investigating visual and semantic representations and their interplay.

      (2) The introduction and discussion are very well written in the context of trying to understand the nature of multimodal representations and present a comprehensive and very useful review of the current literature on the topic.

      Weaknesses:

      (1) The paper is framed as presenting a dataset, yet most of it revolves around the presentation of findings in relation to what the authors call modality-agnostic representations, and in part around mental imagery. This makes it very difficult to assess the manuscript, whether the authors have achieved their aims, and whether the results support the conclusions.

      Thanks for this insightful remark. The dataset release is only a secondary contribution of our study; this was not clear enough in the previous version. We updated the manuscript to make the main objective of the paper more clear, as outlined in our general response to the reviews (see above).

      (2) While the authors have presented a potential use case for such a dataset, there is currently far too little detail regarding data quality metrics expected from the introduction of similar datasets, including the absence of head-motion estimates, quality of intersession alignment, or noise ceilings of all individuals.

      As already mentioned in the general response, the main focus of the paper is to introduce modality-agnostic decoders. The dataset is released in addition, this is why we did not focus on reporting extensive quality metrics in the original manuscript. To respond to your request, we updated the appendix of the manuscript to include a range of data quality metrics. 

      The updated appendix includes head motion estimates in the form of realignment parameters and framewise displacement, as well as a metric to assess the quality of intersession alignment. More detailed descriptions can be found in Appendix 1 of the updated manuscript.

      Estimating noise ceilings based on repeated presentations of stimuli (as for example done in Allen et al. (2022)) requires multiple betas for each stimulus. All training stimuli were only presented once, so this could only be done for the test stimuli which were presented repeatedly. However, during our preprocessing procedure we directly calculated stimulus-specific betas based on data from all sessions using one single GLM, which means that we did not obtain separate betas for repeated presentations of the same stimulus. We will however share the raw data publicly, so that such noise ceilings can be calculated using an adapted preprocessing procedure if required.

      Allen, E. J., St-Yves, G., Wu, Y., Breedlove, J. L., Prince, J. S., Dowdle, L. T., Nau, M., Caron, B., Pestilli, F., Charest, I., Hutchinson, J. B., Naselaris, T., & Kay, K. (2022). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1), 116–126. https://doi.org/10.1038/s41593-021-00962-x

      (3) The exact methods and statistical analyses used are still opaque, making it hard for a reader to understand how the authors achieved their results. More detail in the manuscript would be helpful, specifically regarding the exact statistical procedures, what tests were performed across, or how data were pooled across participants.

      In the updated manuscript, we improved the level of detail for the descriptions of statistical analyses wherever possible (see also our response to your “Recommendations for the authors”, Point 6).

      Regarding data pooling across participants: 

      Figure 8 shows averaged results across all subjects (as indicated in the caption)

      Regarding data pooling for the estimation of the significance threshold of the searchlight analysis for modality-invariant regions: We updated the manuscript to clarify that we performed a permutation test, combined with a bootstrapping procedure to estimate a group-level null distribution: “For each subject, we evaluated the decoders 100 times with shuffled labels to create per-subject chance-level results. Then, we randomly selected one of the 100 chance-level results for each of the 6 subjects and calculated group-level statistics (TFCE values) the exact same way as described in the preceding paragraph. We repeated this procedure 10,000 times resulting in 10,000 permuted group-level results.”

      Additionally, we indicated that the same permutation testing methods were applied to assess the significance threshold for the imagery decoding searchlight maps (Figure 10). 

      (4) Many findings (e.g., Figure 6) are still qualitative but could be supported by quantitative measures.

      The Figures 6 and 7 are intentionally qualitative results to support the quantitative decoding results presented in Figures 4 and 5. (see also Reviewer 2 Comment 2)

      Figures 4 and 5 show pairwise decoding accuracy as a quantitative measure for evaluation of the decoders. This metric is the main metric we used to compare different decoder types and features. Based on the finding that modality-agnostic decoders using imagebind features achieve the best score on this metric, we performed the additional qualitative analysis presented in Figures 6 and 7. (Note that we expanded the candidate set for the qualitative analysis in order to have a larger and more diverse set of images.)

      (5) Results are significant in regions that typically lack responses to visual stimuli, indicating potential bias in the classifier. This is relevant for the interpretation of the findings. A classification approach less sensitive to outliers (e.g., 70-way classification) could avoid this issue. Given the extreme collinearity of the experimental design, regressors in close temporal proximity will be highly similar, which could lead to leakage effects.

      It is true that our searchlight analysis revealed significant activity in regions outside of the visual cortex. However, it is assumed that the processing of visual information does not stop at the border of the visual cortex. The integration of information such as the semantics of the image is progressively processed in other higher-level regions of the brain. Recent studies have shown that activity in large areas of the cortex (including many outside of the visual cortex) can be related to visual stimulation (Solomon et al. 2024; Raugel et al. 2025). Our work confirms this finding and we therefore do not see reason to believe that this is due to a bias in our decoders.

      Further, you are suggesting that we could replace our regression approach with a 70-way classification. However, this is difficult using our fMRI data as we do not see a straightforward way to assign the training and testing stimuli with class labels (the two datasets consist of non-overlapping sets of naturalistic images).

      To address your concerns regarding the collinearity of the experimental design and possible leakage effects, we trained and evaluated a decoder for one subject after running a “null-hypothesis” adapted preprocessing. More specifically, for all sessions, we shifted the functional data of all runs by one run (moving the data of the last run to the very front), but leaving the design matrices in place. Thereby, we destroyed the relationship of stimuli and brain activity but kept the original data and design with its collinearity (and possible biases). We preprocessed this adapted data for subject 1, and ran a whole-brain decoding using Imagebind features and verified that the decoding performance was at chance level:  Pairwise accuracy (captions): 0.43 | Pairwise accuracy (images): 0.47 | Pairwise accuracy (imagery): 0.50. This result provides evidence against the notion that potential collinearity or biases in our experimental design or evaluation procedure could have led to inflated results.

      Raugel, J., Szafraniec, M., Vo, H.V., Couprie, C., Labatut, P., Bojanowski, P., Wyart, V. and King, J.R. (2025). Disentangling the Factors of Convergence between Brains and Computer Vision Models. arXiv preprint arXiv:2508.18226.

      Solomon, S. H., Kay, K., & Schapiro, A. C. (2024). Semantic plasticity across timescales in the human brain. bioRxiv, 2024-02.

      (6) The manuscript currently lacks a limitations section, specifically regarding the design of the experiment. This involves the use of the overly homogenous dataset Coco, which invites overfitting, the mixing of sentence descriptions and visual images, which invites imagery of previously seen content, and the use of a 1-back task, which can lead to carry-over effects to the subsequent trial.

      Regarding the dataset CoCo: We agree that CoCo is somewhat homogenous, it is however much more diverse and naturalistic than the smaller datasets used in previous fMRI experiments with multimodal stimuli. Additionally, CoCo has been widely adopted as a benchmark dataset in the Machine Learning community, and features rich annotations for each image (e.g. object labels, segmentations, additional captions, people’s keypoints) facilitating many more future analyses based on our data.

      Regarding the mixing of sentence descriptions and images: Subjects were not asked to visualize sentences and different techniques for the one-back tasks might have been used. Generally, we do not see it as problematic if subjects are performing visual imagery to some degree while reading sentences, and this might even be the case during normal reading as well. A more targeted experiment comparing reading with and without interleaved visual stimulation in the form of images and a one-back task would be required to assess this, but this was not the focus of our study. For now, it is true that we can not be sure that our results generalize to cases in which subjects are just reading and are less incentivized to perform mental imagery.

      Regarding the use of a 1-back task: It was necessary to make some design choices in order to realize this large-scale data collection with approximately 10 hours of recording per subject. Specifically, the 1-back task was included in the experimental setup in order to assure continuous engagement of the participant during the rather long sessions of 1 hour. The subjects did indeed need to remember the previous stimulus to succeed at the 1-back task, which means that some brain activity during the presentation of a stimulus is likely to be related to the previous stimulus. We aimed to account for this confound during the preprocessing stage when fitting the GLM, which was fit to capture only the response to the presented image/caption, not the preceding one. Still, it might have picked up on some of the activity from preceding stimuli, causing some decrease of the final decoding performance.

      We added a limitations section to the updated manuscript to discuss these important issues.

      (7) I would urge the authors to clarify whether the primary aim is the introduction of a dataset and showing the use of it, or whether it is the set of results presented. This includes the title of this manuscript. While the decoding approach is very interesting and potentially very valuable, I believe that the results in the current form are rather descriptive, and I'm wondering what specifically they add beyond what is known from other related work. This includes imagery-related results. This is completely fine! It just highlights that a stronger framing as a dataset is probably advantageous for improving the significance of this work.

      Thanks a lot for pointing this out. Based on this comment and feedback from the other reviewers we restructured the abstract, introduction and discussion section of the paper to better reflect the primary aim. (cf. general response above).

      You further mention that it is not clear what our results add beyond what is known from related work. We list the main contributions here:

      A single modality-agnostic decoder can decode the semantics of visual and linguistic stimuli irrespective of the presentation modality with a performance that is not lagging behind modality-specific decoders.

      Modality-agnostic decoders outperform modality-specific decoders for decoding captions and mental imagery.

      Modality-invariant representations are widespread across the cortex (a range of previous work has suggested they were much more localized (Bright et al. 2004; Jung et al. 2018; Man et al. 2012; Simanova et al. 2014).

      Regions that are useful for imagery are largely overlapping with modality-invariant regions

      Bright, P., Moss, H., & Tyler, L. K. (2004). Unitary vs multiple semantics: PET studies of word and picture processing. Brain and language, 89(3), 417-432.

      Jung, Y., Larsen, B., & Walther, D. B. (2018). Modality-Independent Coding of Scene Categories in Prefrontal Cortex. Journal of Neuroscience, 38(26), 5969–5981.

      Liuzzi, A. G., Bruffaerts, R., Peeters, R., Adamczuk, K., Keuleers, E., De Deyne, S., Storms, G., Dupont, P., & Vandenberghe, R. (2017). Cross-modal representation of spoken and written word meaning in left pars triangularis. NeuroImage, 150, 292–307. https://doi.org/10.1016/j.neuroimage.2017.02.032

      Man, K., Kaplan, J. T., Damasio, A., & Meyer, K. (2012). Sight and Sound Converge to Form Modality-Invariant Representations in Temporoparietal Cortex. Journal of Neuroscience, 32(47), 16629–16636.

      Simanova, I., Hagoort, P., Oostenveld, R., & van Gerven, M. A. J. (2014). Modality-Independent Decoding of Semantic Information from the Human Brain. Cerebral Cortex, 24(2), 426–434.

      Reviewer #2 (Public review):

      Summary:

      This study introduces SemReps-8K, a large multimodal fMRI dataset collected while subjects viewed natural images and matched captions, and performed mental imagery based on textual cues. The authors aim to train modality-agnostic decoders--models that can predict neural representations independently of the input modality - and use these models to identify brain regions containing modality-agnostic information. They find that such decoders perform comparably or better than modality-specific decoders and generalize to imagery trials.

      Strengths:

      (1) The dataset is a substantial and well-controlled contribution, with >8,000 image-caption trials per subject and careful matching of stimuli across modalities - an essential resource for testing theories of abstract and amodal representation.

      (2) The authors systematically compare unimodal, multimodal, and cross-modal decoders using a wide range of deep learning models, demonstrating thoughtful experimental design and thorough benchmarking.

      (3) Their decoding pipeline is rigorous, with informative performance metrics and whole-brain searchlight analyses, offering valuable insights into the cortical distribution of shared representations.

      (4) Extension to mental imagery decoding is a strong addition, aligning with theoretical predictions about the overlap between perception and imagery.

      Weaknesses:

      While the decoding results are robust, several critical limitations prevent the current findings from conclusively demonstrating truly modality-agnostic representations:

      (1) Shared decoding ≠ abstraction: Successful decoding across modalities does not necessarily imply abstraction or modality-agnostic coding. Participants may engage in modality-specific processes (e.g., visual imagery when reading, inner speech when viewing images) that produce overlapping neural patterns. The analyses do not clearly disambiguate shared representational structure from genuinely modality-independent representations. Furthermore, in Figure 5, the modality-agnostic encoder did not perform better than the modality-specific decoder trained on images (in decoding images), but outperformed the modality-specific decoder trained on captions (in decoding captions). This asymmetry contradicts the premise of a truly "modality-agnostic" encoder. Additionally, given the similar performance between modality-agnostic decoders based on multimodal versus unimodal features, it remains unclear why neural representations did not preferentially align with multimodal features if they were truly modality-independent.

      We agree that successful modality-agnostic and cross-modal decoding does not necessarily imply that abstract patterns were decoded. In the updated manuscript, we therefore refer to these representations as modality-invariant (see also the updated terminology explained in the general response above).

      If participants are performing mental imagery when reading, and this is allowing us to perform cross-decoding, then this means that modality-invariant representations are formed during this mental imagery process, i.e. that the representations formed during this form of mental imagery are compatible with representations during visual perception (or, in your words, produce overlapping neural patterns). While we can not know to what extent people were performing mental imagery while reading (or having inner speech while viewing images), our results demonstrate that their brain activity allows for decoding across modalities, which implies that modality-invariant representations are present.

      It is true that our current analyses can not disambiguate modality-invariant representations (or, in your words, shared representational structure) from abstract representations (in your words, genuinely modality-independent representations). As the main goal of the paper was to build modality-agnostic decoders, and these only require what we call “modality-invariant” representations (see our updated terminology in the general reviewer response above), we leave this question open for future work. We do however discuss this important limitation in the Discussion section of the updated manuscript.

      Regarding the asymmetry of decoding results when comparing modality-agnostic decoders with the two respective modality-specific decoders for captions and images: We do not believe that this asymmetry contradicts the premise of a modality-agnostic decoder. Multiple explanations for this result are possible: (1) The modality-specific decoder for images might benefit from the more readily decodable lower-level modality-dependent neural activity patterns in response to images, which are less useful for the modality-agnostic decoder because they are not useful for decoding caption trials. The modality-specific decoders for captions might not be able to pick up on low-level modality-dependent neural activity patterns as these might be less easily decodable. 

      The signal-to-noise ratio for caption trials might be lower than for image trials (cf. generally lower caption decoding performance), therefore the addition of training data (even if it is from another modality) improves the decoding performance for captions, but not for images (which might be at ceiling already).

      Regarding the similar performance between modality-agnostic decoders based on multimodal versus unimodal features: Unimodal features are based on rather high-level features of the respective modality (e.g. last-layer features of a model trained for semantic image classification), which can be already modality-invariant to some degree. Additionally, as already mentioned before, in the updated manuscript we only require representations to be modality-invariant and not necessarily abstract.

      (2) The current analysis cannot definitively conclude that the decoder itself is modality-agnostic, making "Qualitative Decoding Results" difficult to interpret in this context. This section currently provides illustrative examples, but lacks systematic quantitative analyses.

      The qualitative decoding results in Figures 6 and 7 present exemplary qualitative results for the quantitative results presented in Figures 4 and 5 (see also Reviewer 1 Comment 4).

      Figures 4 and 5 show pairwise decoding accuracy as a quantitative measure for evaluation of the decoders. This metric is the main metric we used to compare different decoder types and features. Based on the finding that modality-agnostic decoders using imagebind features achieve the best score on this metric, we performed the additional qualitative analysis presented in Figures 6 and 7. (Note that we expanded the candidate set for the qualitative analysis in order to have a larger and more diverse set of images.)

      (3) The use of mental imagery as evidence for modality-agnostic decoding is problematic.

      Imagery involves subjective, variable experiences and likely draws on semantic and perceptual networks in flexible ways. Strong decoding in imagery trials could reflect semantic overlap or task strategies rather than evidence of abstraction.

      It is true that mental imagery does not necessarily rely on modality-agnostic representations. In the updated manuscript we revised our terminology and refer to the analyzed representations as modality-invariant, which we define as “representations that significantly overlap between modalities”. 

      The manuscript presents a methodologically sophisticated and timely investigation into shared neural representations across modalities. However, the current evidence does not clearly distinguish between shared semantics, overlapping unimodal processes, and true modality-independent representations. A more cautious interpretation is warranted.

      Nonetheless, the dataset and methodological framework represent a valuable resource for the field.

      We fully agree with these observations, and updated our terminology as outlined in the general response.

      Reviewer #3 (Public review):

      Summary:

      The authors recorded brain responses while participants viewed images and captions. The images and captions were taken from the COCO dataset, so each image has a corresponding caption, and each caption has a corresponding image. This enabled the authors to extract features from either the presented stimulus or the corresponding stimulus in the other modality.

      The authors trained linear decoders to take brain responses and predict stimulus features.

      "Modality-specific" decoders were trained on brain responses to either images or captions, while "modality-agnostic" decoders were trained on brain responses to both stimulus modalities. The decoders were evaluated on brain responses while the participants viewed and imagined new stimuli, and prediction performance was quantified using pairwise accuracy. The authors reported the following results:

      (1) Decoders trained on brain responses to both images and captions can predict new brain responses to either modality.

      (2) Decoders trained on brain responses to both images and captions outperform decoders trained on brain responses to a single modality.

      (3) Many cortical regions represent the same concepts in vision and language.

      (4) Decoders trained on brain responses to both images and captions can decode brain responses to imagined scenes.

      Strengths:

      This is an interesting study that addresses important questions about modality-agnostic representations. Previous work has shown that decoders trained on brain responses to one modality can be used to decode brain responses to another modality. The authors build on these findings by collecting a new multimodal dataset and training decoders on brain responses to both modalities.

      To my knowledge, SemReps-8K is the first dataset of brain responses to vision and language where each stimulus item has a corresponding stimulus item in the other modality. This means that brain responses to a stimulus item can be modeled using visual features of the image, linguistic features of the caption, or multimodal features derived from both the image and the caption. The authors also employed a multimodal one-back matching task, which forces the participants to activate modality-agnostic representations. Overall, SemReps-8K is a valuable resource that will help researchers answer more questions about modality-agnostic representations.

      The analyses are also very comprehensive. The authors trained decoders on brain responses to images, captions, and both modalities, and they tested the decoders on brain responses to images, captions, and imagined scenes. They extracted stimulus features using a range of visual, linguistic, and multimodal models. The modeling framework appears rigorous, and the results offer new insights into the relationship between vision, language, and imagery. In particular, the authors found that decoders trained on brain responses to both images and captions were more effective at decoding brain responses to imagined scenes than decoders trained on brain responses to either modality in isolation. The authors also found that imagined scenes can be decoded from a broad network of cortical regions.

      Weaknesses:

      The characterization of "modality-agnostic" and "modality-specific" decoders seems a bit contradictory. There are three major choices when fitting a decoder: the modality of the training stimuli, the modality of the testing stimuli, and the model used to extract stimulus features. However, the authors characterize their decoders based on only the first choice-"modality-specific" decoders were trained on brain responses to either images or captions, while "modality-agnostic" decoders were trained on brain responses to both stimulus modalities. I think that this leads to some instances where the conclusions are inconsistent with the methods and results.

      In our analysis setup, a decoder is entirely determined by two factors: (1) the modality of the stimuli that the subject was exposed to, and (2) the machine learning model used to extract stimulus features.

      The modality of the testing stimuli defines whether we are evaluating the decoder in a within-modality or cross-modality setting, but is not an inherent characteristic of a trained decoder

      First, the authors suggest that "modality-specific decoders are not explicitly encouraged to pick up on modality-agnostic features during training" (line 137) while "modality-agnostic decoders may be more likely to leverage representations that are modality-agnostic" (line 140). However, whether a decoder is required to learn modality-agnostic representations depends on both the training responses and the stimulus features. Consider the case where the stimuli are represented using linguistic features of the captions. When you train a "modality-specific" decoder on image responses, the decoder is forced to rely on modality-agnostic information that is shared between the image responses and the caption features. On the other hand, when you train a "modality-agnostic" decoder on both image responses and caption responses, the decoder has access to the modality-specific information that is shared by the caption responses and the caption features, so it is not explicitly required to learn modality-agnostic features. As a result, while the authors show that "modality-agnostic" decoders outperform "modality-specific" decoders in most conditions, I am not convinced that this is because they are forced to learn more modality-agnostic features.

      It is true that for example a modality-specific decoder trained on fmri data from images with stimulus features extracted from captions might also rely on modality-invariant features. We still call this decoder modality-specific, as it has been trained to decode brain activity recorded from a specific stimulus modality. In the updated manuscript we corrected the statement that “modality-specific decoders are not explicitly encouraged to pick up on modality-invariant features during training” to include the case of decoders trained on features from the other modality which might also rely on modality-invariant features.

      It is true that a modality-agnostic decoder can also have access to modality-dependent information for captions and images. However, as it is trained jointly with both modalities and the modality-dependent features are not compatible, it is encouraged to rely on modality-invariant features. The result that modality-agnostic decoders are outperforming modality-specific decoders trained on captions for decoding captions confirms this, because if the decoder was only relying on modality-dependent features the addition of additional training data from another stimulus modality could not increase the performance. (Also, the lack of a performance drop compared to modality-specific decoders trained on images is only possible thanks to the reliance on modality-invariant features. If the decoder only relied on modality-dependent features the addition of data from another modality would equal an addition of noise to the training data which must result in a performance drop at test time.). We can not exclude the possibility that modality-agnostic decoders are also relying on modality-dependent features, but our results suggest that they are relying at least to some degree on modality-invariant features.

      Second, the authors claim that "modality-specific decoders can be applied only in the modality that they were trained on, while "modality-agnostic decoders can be applied to decode stimuli from multiple modalities, even without knowing a priori the modality the stimulus was presented in" (line 47). While "modality-agnostic" decoders do outperform "modality-specific" decoders in the cross-modality conditions, it is important to note that "modality-specific" decoders still perform better than expected by chance (figure 5). It is also important to note that knowing about the input modality still improves decoding performance even for "modality-agnostic" decoders, since it determines the optimal feature space-it is better to decode brain responses to images using decoders trained on image features, and it is better to decode brain responses to captions using decoders trained on caption features.

      Thanks for this important remark. We corrected this statement and now say that “modality-specific decoders that are trained to be applied only in the modality that they were trained on”, highlighting that their training process optimizes them for decoding in a specific modality. They can indeed be applied to the other modality at test time, this however results in a substantial performance drop.

      It is true that knowing the input modality can improve performance even for modality-agnostic decoders. This can most likely be explained by the fact that in that case the decoder can leverage both, modality-invariant and modality-dependent features. We will not further focus on this result however as the main motivation to build modality-agnostic decoders is to be able to decode stimuli without knowing the stimulus modality a priori. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I will list additional recommendations below in no specific order:

      (1) I find the term "modality agnostic" quite unusual, and I believe I haven't seen it used outside of the ML community. I would urge the authors to change the terminology to be more common, or at least very early explain why the term is much better suited than the range of existing terms. A modality agnostic representation implies that it is not committed to a specific modality, but it seems that a representation cannot be committed to something.

      In the updated manuscript we now refer to the identified brain patterns as modality-invariant, which has previously been used in the literature (Man et al. 2012; Devereux et al. 2013; Patterson et al. 2016; Deniz et al. 2019, Nakai et al. 2021) (see also the general response on top and the Introduction and Related Work sections of the updated manuscript).

      We continue to refer to the decoders as modality-agnostic, as this is a new type of decoder, and describes the fact that they are trained in a way that abstracts away from the modality of the stimuli. We chose this term as we are not aware of any work in which brain decoders were trained jointly on multiple stimulus modalities and in order not to risk contradictions/confusions with other definitions.

      Deniz, F., Nunez-Elizalde, A. O., Huth, A. G., & Gallant, J. L. (2019). The Representation of Semantic Information Across Human Cerebral Cortex During Listening Versus Reading Is Invariant to Stimulus Modality. Journal of Neuroscience, 39(39), 7722–7736. https://doi.org/10.1523/JNEUROSCI.0675-19.2019

      Devereux, B. J., Clarke, A., Marouchos, A., & Tyler, L. K. (2013). Representational Similarity Analysis Reveals Commonalities and Differences in the Semantic Processing of Words and Objects. The Journal of Neuroscience, 33(48).

      Nakai, T., Yamaguchi, H. Q., & Nishimoto, S. (2021). Convergence of Modality Invariance and Attention Selectivity in the Cortical Semantic Circuit. Cerebral Cortex, 31(10), 4825–4839. https://doi.org/10.1093/cercor/bhab125

      Man, K., Kaplan, J. T., Damasio, A., & Meyer, K. (2012). Sight and Sound Converge to Form Modality-Invariant Representations in Temporoparietal Cortex. Journal of Neuroscience, 32(47), 16629–16636.

      Patterson, K., & Lambon Ralph, M. A. (2016). The Hub-and-Spoke Hypothesis of Semantic Memory. In Neurobiology of Language (pp. 765–775). Elsevier. https://doi.org/10.1016/B978-0-12-407794-2.00061-4

      (2) The table in Figure 1B would benefit from also highlighting the number of stimuli that have overlapping captions and images.

      The number of overlapping stimuli is rather small (153-211 stimuli depending on the subject). We added this information to Table 1B. 

      (3) The authors wrote that training stimuli were presented only once, yet they used a one-back task. Did the authors also exclude the first presentation of these stimuli?

      Thanks for pointing this out. It is indeed true that some training stimuli were presented more than once, but only for the case of one-back target trials. In these cases the second presentation of the stimulus was excluded, but not the first. As the subject can not be aware of the fact that the upcoming presentation is going to be a one-back target, the first presentation can not be affected by the presence of the subsequent repeated presentation. We updated the manuscript to clarify this issue.

      (4) Coco has roughly 80-90 categories, so many image captions will be extremely similar (e.g., "a giraffe walking", "a surfer on a wave", etc.). How can people keep these apart?

      It is true that some captions and images are highly similar even though they are not matching in the dataset. This might result in several false button presses because the subjects identified an image-caption pair as matching when in fact it wasn't intended to. However, as there was no feedback given on the task performance, this issue should not have had a major influence on the brain activity of the participants.

      (5) Footnotes for statistics are quite unusual - could the authors integrate statistics into the text?

      Thanks for this remark, in the updated manuscript all statistics are part of the main text.

      (6) It may be difficult to achieve the assumptions of a permutation test - exchangeability, which may bias statistical results. It is not uncommon for densely sampled datasets to use bootstrap sampling on the predictions of the test data to identify if a given percentile of that distribution crosses 0. The lowest p-value is given by the number of bootstrap samples (e.g., if all 10,000 bootstrap samples are above chance, then p < 0.0001). This may turn out to be more effective.

      Thanks for this comment. Our statistical procedure was in fact involving a bootstrapping procedure to generate a null distribution on the group-level. We updated the manuscript to describe this method in more detail. Here is the updated paragraph: “To estimate the statistical significance of the resulting clusters we performed a permutation test, combined with a bootstrapping procedure to estimate a group-level null distribution see also Stelzer et al., 2013). For each subject, we evaluated the decoders 100 times with shuffled labels to create per-subject chance-level results. Then, we randomly selected one of the 100 chance-level results for each of the 6 subjects and calculated group-level statistics (TFCE values) the exact same way as described in the preceding paragraph. We repeated this procedure 10,000 times resulting in 10,000 permuted group-level results. We ensured that every permutation was unique, i.e. no two permutations were based on the same combination of selected chance-level results. Based on this null distribution, we calculated p-values for each vertex by calculating the proportion of sampled permutations where the TFCE value was greater than the observed TFCE value. To control for multiple comparisons across space, we always considered the maximum TFCE score across vertices for each group-level permutation (Smith and Nichols, 2009).”

      (7) The authors present no statistical evidence for some of their claims (e.g., lines 335-337). It would be good if they could complement this in their description. Further, the visualization in Figure 4 is rather opaque. It would help if the authors could add a separate bar for the average modality-specific and modality-agnostic decoders or present results in a scatter plot, showing modality-specific on the x-axis and modality-agnostic on the y-axis and color-code the modality (i.e., making it two scatter colors, one for images, one for captions). All points will end up above the diagonal.

      We updated the manuscript and added statistical evidence for the claims made:

      We now report results for the claim that when considering the average decoding performance for images and captions, modality-agnostic decoders perform better than modality-specific decoders, irrespective of the features that the decoders were trained on.

      Additionally, we report the average modality-agnostic and modality-specific decoding accuracies corresponding to Figure 4. For modality-agnostic decoders the average value is 81.86\%, for modality-specific decoders trained on images 78.15\%, and for modality-specific decoders trained on captions 72.52\%. We did not add a separate bar to Figure 4 as this would add additional information to a Figure which is already very dense in its information content (cf. Reviewers 2’s recommendations for the authors). We therefore believe it is more useful to report the average values in the text and provide results for a statistical test comparing the decoder types. A scatter plot would make it difficult to include detailed information on the features, which we believe is crucial.

      We further provide statistical evidence for the observation regarding the directionality of cross-modal decoding.

      Reviewer #2 (Recommendations for the authors):

      For achieving more evidence to support modality-agnostic representations in the brain, I suggest more thorough analyses, for example:

      (1) Traditional searchlight RSA using different deep learning models. Through this approach, it might identify different brain areas that are sensitive to different formats of information (visual, text, multimodal); subsequently, compare the decoding performance using these ROIs.

      (2) Build more dissociable decoders for information of different modality formats, if possible. While I do not have a concrete proposal, more targeted decoder designs might better dissociate representational formats (i.e., unimodal vs. modality-agnostic).

      (3) A more detailed exploration of the "qualitative decoding results"--for example, quantitatively examining error types produced by modality-agnostic versus modality-specific decoders--would be informative for clarifying what specific content the decoder captures, potentially providing stronger evidence for modality-agnostic representations.

      Thanks for these suggestions. As the main goal of the paper is to introduce modality-agnostic decoders (which should be more clear from the updated manuscript, see also the general response to reviews), we did not include alternative methods for identifying modality-invariant regions. Nonetheless, we agree that in order to obtain more in-depth insight into the nature of representations that were recorded, performing analyses with additional methods such as RSA, comparisons with more targeted decoder designs in terms of their target features will be indispensable, as well as more in-depth error type analyses. We leave these analyses as promising directions for future work.

      The writing could be further improved in the introduction and, accordingly, the discussion. The authors listed a series of theories about conceptual representations; however, they did not systematically explain the relationships and controversies between them, and it seems that they did not aim to address the issues raised by these theories anyway. Thus, the extraction of core ideas is suggested. The difference between "modality-agnostic" and terms like "modality-independent," "modality-invariant," "abstract," "amodal," or "supramodal," and the necessity for a novel term should be articulated.

      The updated manuscript includes an improved introduction and discussion section that highlight the main focus and contributions of the study.

      We believe that a systematic comparison of theories on conceptual representations involving their relationships and controversies would require a dedicated review paper. Here, we focused on the aspects that are relevant for the study at hand (modality-invariant representations), for which we find that none of the considered theories can be rejected based on our results.

      Regarding the terminology (modality-agnostic vs. modality-invariant, ..) please refer to the general response.

      The figures also have room to improve. For example, Figures 4 and 5 present dense bar plots comparing multiple decoding settings (e.g., modality-specific vs. modality-agnostic decoders, feature space, within-modal vs. cross-modal, etc.); while comprehensive, they would benefit from clearer labels or separated subplots to aid interpretation. All figures are recommended to be optimized for greater clarity and directness in future revisions.

      Thanks for this remark. We agree that the figures are quite dense in information. However, splitting them up into subplots (e.g. separate subplots for different decoder types) would make it much less straightforward to compare the accuracy scores between conditions. As the main goal of these figures is to compare features and decoder types, we believe that it is useful to keep all information in the same plot. 

      You are also suggesting to improve the clarity of the labels. It is true that the top left legend of Figures 4 and 5 was mixing information about decoder type and broad classes of features  (vision/language/multimodal). To improve clarity, we updated the figures and clearly separated information on decoder type (the hue of different bars) and features (x-axis labels).  The broad classes of features (vision/language/multimodal) are distinguished by alternating light gray background colors and additional labels at the very bottom of the plots.

      The new plots allow for easy performance comparison of the different decoder types and additionally provide information on confidence intervals for the performance of modality-specific decoders, which was not available in the previous figures.

      Reviewer #3 (Recommendations for the authors):

      (1) As discussed in the Public Review, I think the paper would greatly benefit from clearer terminology. Instead of describing the decoders as "modality-agnostic" and "modality-specific", perhaps the authors could describe the decoding conditions based on the train and test modalities (e.g., "image-to-image", "caption-to-image", "multimodal-to-image") or using the terminology from Figure 3 (e.g., "within-modality", "cross-modality", "modality-agnostic").

      We updated our terminology to be clearer and more accurate, as outlined in the general response. The terms modality-agnostic and modality-specific refer to the training conditions, and the test conditions are described in Figure 3 and are used throughout the paper.

      (2) Line 244: I think the multimodal one-back task is an important aspect of the dataset that is worth highlighting. It seems to be a relatively novel paradigm, and it might help ensure that the participants are activating modality-agnostic representations.

      It is true that the multimodal one-back task could play an important role for the activation of modality-invariant representations. Future work could investigate to what degree the presence of widespread modality-invariant representations is dependent on such a paradigm.

      (3) Line 253: Could the authors elaborate on why they chose a random set of training stimuli for each participant? Is it to make the searchlight analyses more robust?

      A random set of training stimuli was chosen in order to maximize the diversity of the training sets, i.e. to avoid bias based on a specific subsample of the CoCo dataset. Between-subject comparisons can still be made based on the test set which was shared for all subjects, with the limitation that performance differences due to individual differences or to the different training sets can not be disentangled. However, the main goal of the data collection was not to make between-subject comparisons based on common training sets, but rather to make group-level analyses based on a large and maximally diverse dataset. 

      (4) Figure 4: Could the authors comment more on the patterns of decoding performance in Figure 5? For instance, it is interesting that ResNet is a better target than ViT, and BERT-base is a better target than BERT-large.

      A multitude of factors influence the decoding performance, such as features dimensionality, model architecture, training data, and training objective(s) (Conwell et al. 2023; Raugel et al. 2025). Bert-base might be better than bert-large because the extracted features are of lower dimension. Resnet might be better than ViT because of its architecture (CNN vs. Transformer). To dive deeper into these differences further controlled analysis would be necessary, but this is not the focus of this paper. The main objective of the feature comparison was to provide a broad overview over visual/linguistic/multimodal feature spaces and to identify the most suitable features for modality-agnostic decoding.

      Conwell, C., Prince, J. S., Kay, K. N., Alvarez, G. A., & Konkle, T. (2023). What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines? (p. 2022.03.28.485868). bioRxiv. https://doi.org/10.1101/2022.03.28.485868

      Raugel, J., Szafraniec, M., Vo, H.V., Couprie, C., Labatut, P., Bojanowski, P., Wyart, V. and King, J.R. (2025). Disentangling the Factors of Convergence between Brains and Computer Vision Models. arXiv preprint arXiv:2508.18226.

      (5) Figure 7: It is interesting that the modality-agnostic decoder predictions mostly appear traffic-related. Is there a possibility that the model always produces traffic-related predictions, making it trivially correct for the presented stimuli that are actually traffic-related? It could be helpful to include some examples where the decoder produces other types of predictions to dispel this concern.

      The presented qualitative examples were randomly selected. To make sure that the decoder is not always predicting traffic-related content, we included 5 additional randomly selected examples in Figures 6 and 7 of the updated manuscript. In only one of the 5 new examples the decoder was predicting traffic-related content, and in this case the stimulus had actually been traffic-related (a bus).

    1. Author response:

      Reviewer #1:

      Comment 1: The authors use a confusing timeline for their behavioral experiments, i.e., day 1 is the first day of training in the MWM, and day 6 is the probe trial, but in reality, day 6 is the first day after the last training day. So this is really day 1 post-training, and day 20 is 14 days post-training.

      We thank this reviewer for pointing out the issue of the behavioral timeline. We will revise the behavioral timeline as suggested by this reviewer. Days 1–5 will be labeled as “Training phase day 1–5”. Day 6 will be labeled as the “Day 1 post-training” and Day 20 will be labeled as the “Day 14 post-training”.

      Comment 2: The authors inaccurately use memory as a term. During the training period in the MWM, the animals are learning, while memory is only probed on day 6 (after learning). Thus, day 6 reflects memory consolidation processes after learning has taken place.

      We will revise the manuscript to distinguish between "learning" and "memory." We will refer to the performance during the 5-day training period as "spatial learning" and restrict the term "memory" to the probe tests on Day 6, which reflect memory processes after learning has taken place.

      Comment 3: The NAT10 cKO mice are useful... but all the experiments used AAV-CRE injections in the dorsal hippocampus that showed somewhat modest decreases... For these experiments, it would be better to cross the NAT10 floxed animals to CRE lines where a better knockdown of NAT10 can be achieved, with less variability.

      We want to clarify the reason for using AAV-Cre injection rather than Cre lines. Indeed, we attempted to generate Nat10 conditional knockouts by crossing Nat10<sup>flox/flox</sup> mice with several CNS-specific Cre lines. Crossing with Nestin-Cre and Emx1-Cre resulted in embryonic and premature lethality, respectively, consistent with the essential housekeeping function of NAT10 during neurodevelopment. We are currently using the Camk2α-Cre line which starts to express Cre after postnatal 3 weeks specifically in hippocampal pyramidal neurons (Tsien et al., 1996).

      Comment 4: Because knockdown is only modest (~50%), it is not clear if the remaining ac4c on mRNAs is due to remaining NAT10 protein or due to an alternative writer (as the authors pose).

      Our results suggest the existence of alternative writers. As shown in Figure 6D, we identified a population of "NAT10-independent" MISA mRNAs (present in MISA but not downregulated in NASA). Remarkably, these mRNAs possess a consensus motif (RGGGCACTAACY) that is fundamentally different from the canonical NAT10 motif (AGCAGCTG). This distinct motif usage suggests that the residual ac4C signals are not merely due to incomplete knockdown of NAT10, but reflect the activity of other, as-yet-unidentified ac4C writers. Nonetheless, we think that generation of a Nat10 knockout line with completely loss of NAT10 proteins is useful to address this reviewer’s concern.

      Reviewer #2:

      Comment 1: It is known that synaptosomes are contaminated with glial tissue... So the candidate mRNAs identified by acRIP-seq might also be mixed with glial mRNAs. Are the GO BP terms shown in Figure 3A specifically chosen, or unbiasedly listed for all top ones?

      It is true that some ac4C-mRNAs identified by acRIP-seq from the synaptosomes are highly expressed in astrocyte, such as Aldh1l1, ApoE, Sox9 and Aqp4 (Table S3, Fig. S6H). In agreement, we found that NAT10 was also expressed in astrocyte in addition to neurons. We will show representative image for the expression of NAT10-Cre in astrocytes in the revised MS. The BP items shown in Fig. 3A were chosen from top 30 and highly related with synaptic plasticity and memory. We will show the full list of significant BP items for MISA in the revised MS.

      Comment 2: Where does NAT10-mediated mRNA acetylation take place within cells generally? Is there evidence that NAT10 can catalyze mRNA acetylation in the cytoplasm?

      The previous studies from non-neuronal cells showed that NAT10 can catalyze mRNA acetylation in the cytoplasm and enhance translational efficiency (Arango et al., 2018; Arango et al., 2022). In this study, we showed that mRNA acetylation occurred both in the homogenates and synapses (see ac4C-mRNA lists in Table S2 and S3). However, spatial memory upregulated mRNA acetylation mainly in the synapses rather than in the homogenates (Fig. 2 and Fig. S2).

      Comment 3: "The NAT10 proteins were significantly reduced in the cytoplasm (S2 fraction) but increased in the PSD fraction..." The small increase in synaptic NAT10 might not be enough to cause a decrease in soma NAT10 protein level.

      We showed that the NAT10 protein levels were increased by one-fold in the PSD fraction, but were reduced by about 50% in the cytoplasm after memory formation (Fig. 5J and K). The protein levels of NAT10 in the homogenates and nucleus were not altered after memory formation (Fig. 5F and I). Due to these facts, we hypothesized that NAT10 proteins may have a relocation from cytoplasm to synapses after memory formation, which was also supported by the immunofluorescent results from cultured neurons (Fig. S4). However, we agree with this reviewer that drawing such a conclusion may require the time-lapse imaging of NAT10 protein trafficking in living animals, which is technically challenging at this moment.

      Comment 4: It is difficult to separate the effect on mRNA acetylation and protein mRNA acetylation when doing the loss of function of NAT10.

      This is a good point. We agree with this reviewer that NAT10 may acetylate both mRNA and proteins. We examined the acetylation levels of -tubulin and histone H3, two substrate proteins of NAT10 in the hippocampus of Nat10 cKO mice. As shown in Fig S5C, E, and F, the acetylation levels of -tubulin and histone H3 remained unchanged in the Nat10 cKO mice, likely due to the compensation by other protein acetyltransferases. In contrast, mRNA ac4C levels were significantly decreased in the Nat10 cKO mice (Figure S5G–H). These results suggest that the memory deficits seen in Nat10 cKO mice may be largely due to the impaired mRNA acetylation. Nonetheless, we believe that developing a new technology which enables selective erasure of mRNA acetylation would be helpful to address the function of mRNA. We discussed these points in the MS (line 585-592).

      References

      Arango, D., Sturgill, D., Alhusaini, N., Dillman, A. A., Sweet, T. J., Hanson, G., Hosogane, M., Sinclair, W. R., Nanan, K. K., & Mandler, M. D. (2018). Acetylation of cytidine in mRNA promotes translation efficiency. Cell, 175(7), 1872-1886. e1824.

      Arango, D., Sturgill, D., Yang, R., Kanai, T., Bauer, P., Roy, J., Wang, Z., Hosogane, M., Schiffers, S., & Oberdoerffer, S. (2022). Direct epitranscriptomic regulation of mammalian translation initiation through N4-acetylcytidine. Molecular cell, 82(15), 2797-2814. e2711.

      Tsien, J. Z., Chen, D. F., Gerber, D., Tom, C., Mercer, E. H., Anderson, D. J., Mayford, M., Kandel, E. R., & Tonegawa, S. (1996). Subregion-and cell type–restricted gene knockout in mouse brain. Cell, 87(7), 1317-1326.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The paper by Chen et al describes the role of neuronal themo-TRPV3 channels in the firing of cortical neurons at a fever temperature range. The authors began by demonstrating that exposure to infrared light increasing ambient temperature causes body temperature to rise to a fever level above 38{degree sign}C. Subsequently, they showed that at the fever temperature of 39{degree sign}C, the spike threshold (ST) increased in both populations (P12-14 and P7-8) of cortical excitatory pyramidal neurons (PNs). However, the spike number only decreased in P7-8 PNs, while it remained stable in P12-14 PNs at 39 degrees centigrade. In addition, the fever temperature also reduced the late peak postsynaptic potential (PSP) in P12-14 PNs. The authors further characterized the firing properties of cortical P12-14 PNs, identifying two types: STAY PNs that retained spiking at 30{degree sign}C, 36{degree sign}C, and 39{degree sign}C, and STOP PNs that stopped spiking upon temperature change. They further extended their analysis and characterization to striatal medium spiny neurons (MSNs) and found that STAY MSNs and PNs shared the same ST temperature sensitivity. Using small molecule tools, they further identified that themo-TRPV3 currents in cortical PNs increased in response to temperature elevation, but not TRPV4 currents. The authors concluded that during fever, neuronal firing stability is largely maintained by sensory STAY PNs and MSNs that express functional TRPV3 channels. Overall, this study is well designed and executed with substantial controls, some interesting findings, and quality of data. Here are some specific comments:

      (1) Could the authors discuss, or is there any evidence of, changes in TRPV3 expression levels in the brain during the postnatal 1-4 week age range in mice?

      This is an excellent question. To our knowledge, no published studies have documented changes in TRPV3 expression in the mouse brain during the first to fourth postnatal weeks. Research on TRPV3 expression has primarily relied on RT-PCR analysis of RNA from dissociated adult brain tissue (Jang et al., 2012; Kumar et al., 2018), largely due to the limited availability of effective antibodies for brain sections at the time. Furthermore, the Allen Brain Atlas does not provide data on TRPV3 expression in the developing or postnatal brain. To address this gap, we performed immunohistochemistry to examine TRPV3 expression at P7,

      P14, and P21 (Figure 7). To confirm specificity, the TRPV3 antibody was co-incubated with a TRPV3 blocker (Figure 7A, top row, right panel). While immunohistochemistry is semiquantitative, we observed a trend toward increased TRPV3 expression in the cortex, striatum, hippocampus, and thalamus from P7 to P14.

      (2) Are there any differential differences in TRPV3 expression patterns that could explain the different firing properties in response to fever temperature between the STAY- and STOP neurons?

      This is another excellent question, and we plan to explore it in the future by developing reporter mice for TRPV3 expression and viral tools that leverage endogenous TRPV3 promoters to drive a fluorescent protein, enabling monitoring of cells with native TRPV3 expression. To our knowledge, such tools do not currently exist. Creating them will be challenging, as it requires identifying promoters that accurately reflect endogenous TRPV3 expression.

      We have not yet quantified TRPV3 expression in STOP and STAY neurons. However, our analysis of evoked spiking at 30, 36, and 39 °C suggests that TRPV3 may mark a population of cortical pyramidal neurons that tend to remain active (“STAY”) as temperatures increase. While we have not directly compared TRPV3 expression between STAY and STOP neurons at feverrange temperatures, intracellular blockade of TRPV3 with forsythoside B (50 µM) significantly reduced the proportion of STAY neurons (Figure 9B). Consistently, spiking was also significantly reduced in Trpv3⁻/⁻ mice (Figure 10D).

      In our immunohistochemical analysis, TRPV3 was detected in L4 barrels and in L2/3, where we observed a patchy distribution with some regions showing more intense staining (Figure 7B). It is possible that cells with higher TRPV3 levels correspond to STAY neurons, while those with lower levels correspond to STOP neurons. As we develop tools to monitor activity based on endogenous TRPV3 levels, we anticipate gaining deeper insight into this relationship.

      (3) TRPV3 and TRPV4 can co-assemble to form heterotetrameric channels with distinct functional properties. Do STOP neurons exhibit any firing behaviors that could be attributed to the variable TRPV3/4 assembly ratio?

      There is some evidence that TRPV3 and TRPV4 proteins can physically associate in HEK293 cells and native skin tissues (Hu et al., 2022).TRPV3 and TRPV4 are both expressed in the cortex (Kumar et al., 2018), but it remains unclear whether they are co-expressed and coassembled to form heteromeric channels in cortical excitatory pyramidal neurons. Examination of the I-V curve from HEK cells co-expressing TRPV3/4 heteromeric channels shows enhanced current at negative membrane potentials (Hu et al., 2022).

      Currently, we cannot characterize cells as STOP or STAY and measure TRPV3 or TRPV4 currents simultaneously, as this would require different experimental setups and internal solutions. Additionally, the protocol involves a sequence of recordings at 30, 36, and 39°C, followed by cooling back to 30°C and re-heating to each temperature. Cells undergoing such a protocol will likely not survive till the end.

      In our recordings of TRPV3 currents, which likely include both STOP and STAY cells, we do not observe a significant current at negative voltages, suggesting that TRPV3/4 heteromeric channels may either be absent or underrepresented, at least at a 1:1 ratio. However, the possibility that TRPV3/4 heteromeric channels could define the STOP cell population is intriguing and plausible.

      (4) In Figure 7, have the authors observed an increase of TRPV3 currents in MSNs in response to temperature elevation?

      We have not recorded TRPV3 currents in MSNs in response to elevated temperatures. Please note that the handling editor gave us the option to remove these data from the paper, and we elected to do so to develop them as a separate manuscript.

      (5) Is there any evidence of a relationship between TRPV3 expression levels in D2+ MSNs and degeneration of dopamine-producing neurons?

      This is an interesting question, though it falls outside our current research focus in the lab. A PubMed search yields no results connecting the terms TRPV3, MSNs, and degeneration. However, gain-of-function mutations in TRPV4 channel activity have been implicated in motor neuron degeneration (Sullivan et al., 2024) and axon degeneration (Woolums et al., 2020). Similarly, TRPV1 activation has been linked to developmental axon degeneration (Johnstone et al., 2019), while TRPV3 blockade has shown neuroprotective effects in models of cerebral ischemia/reperfusion injury in mice (Chen et al., 2022).

      The link between TRPV activation and cell degeneration, however, may not be straightforward. For instance, TRPV1 loss has been shown to accelerate stress-induced degradation of axonal transport from retinal ganglion cells to the superior colliculus and to cause degeneration of axons in the optic nerve (Ward et al., 2014). Meanwhile, TRPV1 activation by capsaicin preserves the survival and function of nigrostriatal dopamine neurons in the MPTP mouse model of Parkinson's disease (Chung et al., 2017).

      (6) Does fever range temperature alter the expressions of other neuronal Kv channels known to regulate the firing threshold?

      This is an active line of investigation in our lab. The results of ongoing experiments will provide further insight into this question.

      Reviewer #2 (Public review):

      Summary:

      The authors study the excitability of layer 2/3 pyramidal neurons in response to layer four stimulation at temperatures ranging from 30 to 39 Celsius in P7-8, P12-P14, and P22-P24 animals. They also measure brain temperature and spiking in vivo in response to externally applied heat. Some pyramidal neurons continue to fire action potentials in response to stimulation at 39 C and are called stay neurons. Stay neurons have unique properties aided by TRPV3 channel expression.

      Strengths:

      The authors use various techniques and assemble large amounts of data.

      Weaknesses:

      (1) No hyperthermia-induced seizures were recorded in the study.

      The goal of this manuscript is to uncover age-related physiological changes that enable the brain to maintain function at fever-range temperatures, typically 38–40°C. Febrile seizures in humans are also typically induced within this temperature range. Given this context, we initially did not examine hyperthermia-induced seizures. However, as requested, we assessed the effects of reduced Trpv3 expression on hyperthermia-induced seizures in WT(Trpv3<sup>+/+</sup>), heterozygous (Trpv3<sup>+/-</sup>), and homozygous knockout (Trpv3<sup>-/-</sup>) P12 pups. Please see figure 10.

      While T<sub>b</sub> at seizure onset and the rate of T<sub>b</sub> increase leading to seizure were not significantly different among genotypes, the time to seizure from the point of loss of postural control (LPC), defined as collapse and failure to maintain upright posture, was significantly longer in Trpv3<sup>+/-</sup> and Trpv3<sup>-/-</sup> mice. Together, these results indicate that reduced TRPV3 function enhances resistance to seizure initiation and/or propagation under febrile conditions, likely by decreasing neuronal depolarization and excitability.

      (2) Febrile seizures in humans are age-specific, extending from 6 months to 6 years. While translating to rodents is challenging, according to published literature (see Baram), rodents aged P11-16 experience seizures upon exposure to hyperthermia. The rationale for publishing data on P7-8 and P22-24 animals, which are outside this age window, must be clearly explained to address a potential weakness in the study.

      As requested, we have added an explanation in the “Introduction” for our rationale in including age ranges that flank the period of susceptibility to hyperthermia-induced seizures (see lines 80–100). In summary, we emphasize that this design provides negative controls, allowing us to determine whether the changes observed in the P12–14 window are specific to this developmental period.

      (3) Authors evoked responses from layer 4 and recorded postsynaptic potentials, which then caused action potentials in layer 2/3 neurons in the current clamp. The post-synaptic potentials are exquisitely temperature-sensitive, as the authors demonstrate in Figures 3 B and 7D. Note markedly altered decay of synaptic potentials with rising temperature in these traces. The altered decays will likely change the activation and inactivation of voltage-gated ion channels, adjusting the action potential threshold.

      The activation and inactivation of voltage-gated ion channels can modulate action potential threshold. Indeed, we have identified channels that contribute to the temperature-induced increase in spike threshold, including BK channels and Scn2a. However, Figure 4B represents a cell with no inhibition at 39°C, and thus the observed loss of the late postsynaptic potential (PSP). This primarily contributes to the prolonged decay of the synaptic potentials. By contrast, cells in which inhibition is retained, when exposed to the same thermal protocol, do not exhibit such extended decay.

      (4) The data weakly supports the claim that the E-I balance is unchanged at higher temperatures. Synaptic transmission is exquisitely temperature-sensitive due to the many proteins and enzymes involved. A comprehensive analysis of spontaneous synaptic current amplitude, decay, and frequency is crucial to fully understand the effects of temperature on synaptic transmission.

      We did not intend to imply that E-I balance is generally unchanged at higher temperatures. Our statements specifically referred to observations in experiments conducted during the P20–26 age range in cortical pyramidal neurons. We are conducting a parallel line of investigation examining the differential susceptibility of E-I balance across age and temperature, and we have observed age- and temperature-dependent effects. Recognizing that our earlier wording may have been misleading, we have removed this statement from the manuscript.

      (5) It is unclear how the temperature sensitivity of medium spiny neurons is relevant to febrile seizures. Furthermore, the most relevant neurons are hippocampal neurons since the best evidence from human and rodent studies is that febrile seizures involve the hippocampus.

      Thank you for the opportunity to provide clarification. The goal of this manuscript is to uncover age-related physiological changes that enable the brain to maintain stable, non-excessive neuronal firing at fever-range temperatures (typically 38–40°C). We hypothesize that these changes are a normal part of brain development, potentially explaining why most children do not experience febrile seizures. By understanding these mechanisms, we may identify points in the process that are susceptible to dysfunction, due to genetic mutations, developmental delays, or environmental factors, which could provide insight into the rare cases when seizures occur between 2–5 years of age.

      Our aim was not to establish a link between medium spiny neuron (MSN) function and febrile seizures. MSNs were included in this study as a mechanistic comparison because they represent a non-pyramidal, non-excitatory neuronal subtype, allowing us to assess whether the physiological changes observed in L2/3 excitatory pyramidal neurons are unique to these cells. Please note that the handling editor gave us the option to remove these data from the manuscript, and we chose to do so, developing these findings into a separate manuscript.

      (6) TRP3V3 data would be convincing if the knockout animals did not have febrile seizures.

      We find that approximately equal numbers of excitatory neurons either start or stop firing at fever-range temperatures (typically 38–40 °C). Neurons that continue to fire (“STAY” cells), thus play a key role in maintaining stable, non-excessive network activity. While future studies will examine the mechanisms driving some neurons to initiate spiking, our findings suggest that a reduction in the number of STAY cells could influence more subtle aspects of seizure dynamics, such as time to onset, by decreasing overall network excitability. We assessed the effects of reduced Trpv3 expression on hyperthermia-induced seizures in WT(Trpv3<sup>+/+</sup>), heterozygous (Trpv3<sup>+/-</sup>), and homozygous knockout (Trpv3<sup>-/-</sup>) P12 pups. As you stated, these mice have hyperthermic seizures, however, we noted that the time to seizure from the point of loss of postural control (LPC), defined as collapse and failure to maintain upright posture, was significantly longer in Trpv3<sup>+/-</sup> and Trpv3<sup>-/-</sup> mice. Normally, seizures happen shortly after this point, but notably, Trpv3<sup>-/-</sup> mice took twice as long to reach seizure onset compared with wildtype mice. In an epileptic patient, this increased time may be sufficient for a caretaker to move the patient to a safer location, reducing the risk of injury during the seizure.

      Consistent with findings that TRPV3 blockade using 50 µM forsythoside B reduces spiking in cortical L2/3 pyramidal neurons, we observed significantly reduced spiking in Trpv3<sup>-/-</sup> mice as well (Figure 10D). Analysis of postsynaptic potentials in these neurons showed that, in WT mice, PSP amplitude increased with temperature elevation into the febrile range, whereas this temperature-dependent depolarization was absent in Trpv3<sup>-/-</sup> mice (Figure 10E). Together, these results indicate that reduced TRPV3 function enhances resistance to seizure initiation and/or propagation under febrile conditions, likely by decreasing neuronal depolarization and excitability.

      Reviewer #3 (Public review):

      Summary:

      This important study combines in vitro and in vivo recording to determine how the firing of cortical and striatal neurons changes during a fever range temperature rise (37-40 oC). The authors found that certain neurons will start, stop, or maintain firing during these body temperature changes. The authors further suggested that the TRPV3 channel plays a role in maintaining cortical activity during fever.

      Strengths:

      The topic of how the firing pattern of neurons changes during fever is unique and interesting. The authors carefully used in vitro electrophysiology assays to study this interesting topic.

      Weaknesses:

      (1) In vivo recording is a strength of this study. However, data from in vivo recording is only shown in Figures 5A,B. This reviewer suggests the authors further expand on the analysis of the in vivo Neuropixels recording. For example, to show single spike waveforms and raster plots to provide more information on the recording. The authors can also separate the recording based on brain regions (cortex vs striatum) using the depth of the probe as a landmark to study the specific firing of cortical neurons and striatal neurons. It is also possible to use published parameters to separate the recording based on spike waveform to identify regular principal neurons vs fast-spiking interneurons. Since the authors studied E/I balance in brain slices, it would be very interesting to see whether the "E/I balance" based on the firing of excitatory neurons vs fast-spiking interneurons might be changed or not in the in vivo condition.

      As requested, we have included additional analyses and figures related to the in vivo recording experiments in Figure 5. Specifically, we added examples of multiunit and single-spike waveforms, as well as autocorrelation histograms (ACHs). ACHs were used because raster plots of individual single units would not be very informative given the long recording period. Additionally, Figure 5F was also aimed to replace raster plots as it helps to track changes in the firing rate of a single neurons over time.

      Additionally, all recordings were conducted in the cortex at a depth of ~1 mm from the surface, and no recordings were performed in the striatum. Based on the reviewing editor’s suggestions, we decided to remove the striatal data from the manuscript and develop this aspect of the project for a separate publication.

      Lastly, we used published parameters to classify recordings based on spike waveform into putative regular principal neurons and interneurons. To clarify this point, we have now included descriptions that were previously listed only in the “Methods” section into the “Results” section as well.

      The paragraph below from the methods section describes this procedure.

      “Following manual curation, based on their spike waveform duration, the selected single units (n= 633) were separated into putative inhibitory interneurons and excitatory principal cells (Barthóet al., 2004). The spike duration was calculated as the time difference between the trough and the subsequent waveform peak of the mean filtered (300 – 6000 Hz bandpassed) spike waveform. Durations of extracellularly recorded spikes showed a bimodal distribution (Hartigan’s dip test; p < 0.001) characteristic of the neocortex with shorter durations corresponding to putative interneurons (narrow spikes) and longer durations to putative principal cells (wide spikes). Next, k-means clustering was used to separate the single units into these two groups, which resulted in 140 interneurons (spike duration < 0.6 ms) and 493 principal cells (spike duration > 0.6 ms), corresponding to a typical 22% - 78% (interneuron – principal) cell ratio”.

      As suggested, we calculated the E/I balance using the average firing rates of excitatory and inhibitory neurons in the in vivo condition. Our analysis revealed that the E/I balance remained unchanged (see Author response image 1). Nonetheless, following the option provided by the reviewing editor, we have chosen to remove the statement referencing E/I balance from the manuscript.

      Author response image 1.

      (2) The author should propose a potential mechanism for how TRPV3 helps to maintain cortical activity during fever. Would calcium influx-mediated change of membrane potential be the possible reason? Making a summary figure to put all the findings into perspective and propose a possible mechanism would also be appreciated.

      Thank you for your helpful suggestion. In response, we have included a summary figure (Figure 11) illustrating the hypothesis described in the Discussion section. We agree with your assessment that Trpv3 most likely contributes to maintaining cortical activity during fever by promoting calcium influx and depolarizing the membrane potential.

      (3) The author studied P7-8, P12-14, and P20-26 mice. How do these ages correspond to the human ages? it would be nice to provide a comparison to help the reader understand the context better.

      Ideally, the mouse to human age comparison should depend on the specific process being studied. Per your suggestion, we have added additional references in the Introduction (Dobbing and Sands, 1973; Baram et al., 1997; Bender et al., 2004) to help readers better understand the correspondence between mouse and human ages.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (3) Perform I-F curves to study the intrinsic properties of layer 2/3 neurons without the confound of evoked responses.

      We performed F-I curve analyses (Figures 2H–I), as suggested by Reviewer 2, to study intrinsic properties of L2/3 neurons without evoked responses. Although rheobase increased at 39 °C compared to 30 °C, consistent with findings such as depolarized spike threshold and reduced input resistance, the mean number of spikes across current steps did not differ.

      Reviewer #3 (Recommendations for the authors):

      Some statistical descriptions are not clearly stated. For example, what statistical methods were used in Fig 2E? The effect size in Fig 2D seems to be quite small. The authors are advised to consider "nested analysis" to further increase the rigor of the analysis. Does each dot mean one neuron? Some of the data points might not be totally independent. The author should carefully check all figures to make sure the stats methods are provided for each panel.

      We apologize for not including statistical details in Figure 2E. We have now added this information and verified that statistical descriptions are provided in all figure legends. In Figure 2D, each dot represents a cell, with measurements taken from the same cell at 30°C, 36°C, and 39°C. Given this design, the appropriate test is a one-way repeated-measures ANOVA.

    1. Author response:

      A major point all three reviewers raise is that the ‘human-AI collaboration’ in our experiment may not be true collaboration (as the AI does not classify images per se), but that it is only implied. The reviewers pointed out that whether participants were genuinely engaged in our experimental task is currently not sufficiently addressed. We plan to address this issue in the revised manuscript by including results from a brief interview we conducted after the experiment with each participant, which asked about the participant’s experience and decision-making processes while performing the task. Additionally, we also measured the participants’ propensity to trust in AI via a questionnaire before and after the experiment. The questionnaire and interview results will allow us to more accurately describe the involvement of our participants in the task. Additionally, we will conduct additional analyses of the behavioural data (e.g., response times) to show that participants genuinely completed the experimental task. Finally, we will work to sharpen our language and conclusions in the revised manuscript, following the reviewers’ recommendations.

      Reviewer #1:

      Summary:

      In the study by Roeder and colleagues, the authors aim to identify the psychophysiological markers of trust during the evaluation of matching or mismatching AI decision-making. Specifically, they aim to characterize through brain activity how the decision made by an AI can be monitored throughout time in a two-step decision-making task. The objective of this study is to unfold, through continuous brain activity recording, the general information processing sequence while interacting with an artificial agent, and how internal as well as external information interact and modify this processing. Additionally, the authors provide a subset of factors affecting this information processing for both decisions.

      Strengths:

      The study addresses a wide and important topic of the value attributed to AI decisions and their impact on our own confidence in decision-making. It especially questions some of the factors modulating the dynamical adaptation of trust in AI decisions. Factors such as perceived reliability, type of image, mismatch, or participants' bias toward one response or the other are very relevant to the question in human-AI interactions.

      Interestingly, the authors also question the processing of more ambiguous stimuli, with no real ground truth. This gets closer to everyday life situations where people have to make decisions in uncertain environments. Having a better understanding of how those decisions are made is very relevant in many domains.

      Also, the method for processing behavioural and especially EEG data is overall very robust and is what is currently recommended for statistical analyses for group studies. Additionally, authors provide complete figures with all robustness evaluation information. The results and statistics are very detailed. This promotes confidence, but also replicability of results.

      An additional interesting method aspect is that it is addressing a large window of analysis and the interaction between three timeframes (evidence accumulation pre-decision, decision-making, post-AI decision processing) within the same trials. This type of analysis is quite innovative in the sense that it is not yet a standard in complex experimental designs. It moves forward from classical short-time windows and baseline ERP analysis.

      We appreciate the constructive appraisal of our work.

      Weaknesses:

      R1.1. This manuscript raises several conceptual and theoretical considerations that are not necessarily answered by the methods (especially the task) used. Even though the authors propose to assess trust dynamics and violations in cooperative human-AI teaming decision-making, I don't believe their task resolves such a question. Indeed, there is no direct link between the human decision and the AI decision. They do not cooperate per se, and the AI decision doesn't seem, from what I understood to have an impact on the participants' decision making. The authors make several assumptions regarding trust, feedback, response expectation, and "classification" (i.e., match vs. mismatch) which seem far stretched when considering the scientific literature on these topics.

      This issue is raised by the other reviewers as well. The reviewer is correct in that the AI does not classify images but that the AI response is dependent on the participants’ choice (agree in 75% of trials, disagree in 25% of the trials). Importantly, though, participants were briefed before and during the experiment that the AI is doing its own independent image classification and that human input is needed to assess how well the AI image classification works. That is, participants were led to believe in a genuine, independent AI image classifier on this experiment.

      Moreover, the images we presented in the experiment were taken from previous work by Nightingale & Farid (2022). This image dataset includes ‘fake’ (AI generated) images that are indistinguishable from real images.

      What matters most for our work is that the participants were truly engaging in the experimental task; that is, they were genuinely judging face images, and they were genuinely evaluating the AI feedback. There is strong indication that this was indeed the case. We conducted and recorded brief interviews after the experiment, asking our participants about their experience and decision-making processes. The questions are as follows:

      (1) How did you make the judgements about the images?

      (2) How confident were you about your judgement?

      (3) What did you feel when you saw the AI response?

      (4) Did that change during the trials?

      (5) Who do you think it was correct?

      (6) Did you feel surprised at any of the AI responses?

      (7) How did you judge what to put for the reliability sliders?

      In our revised manuscript we will conduct additional analyses to provide detail on participants’ engagement in the task; both in the judging of the AI faces, as well as in considering the AI feedback. In addition, we will investigate the EEG signal and response time to check for effects that carry over between trials. We will also frame our findings more carefully taking scientific literature into account.

      Nightingale SJ, and Farid H. "AI-synthesized faces are indistinguishable from real faces and more trustworthy." Proceedings of the National Academy of Sciences 119.8 (2022): e2120481119.

      R1.2. Unlike what is done for the data processing, the authors have not managed to take the big picture of the theoretical implications of their results. A big part of this study's interpretation aims to have their results fit into the theoretical box of the neural markers of performance monitoring.

      We indeed used primarily the theoretical box of performance monitoring and predictive coding, since the make-up of our task is similar to a more classical EEG oddball paradigm. In our revised manuscript, we will re-frame and address the link of our findings with the theoretical framework of evidence accumulation and decision confidence.

      R1.3. Overall, the analysis method was very robust and well-managed, but the experimental task they have set up does not allow to support their claim. Here, they seem to be assessing the impact of a mismatch between two independent decisions.

      Although the human and AI decisions are independent in the current experiment, the EEG results still shed light on the participant’s neural processes, as long as the participant considers the AI’s decision and believes it to be genuine. An experiment in which both decisions carry effective consequences for the task and the human-AI cooperation would be an interesting follow-up study.

      Nevertheless, this type of work is very important to various communities. First, it addresses topical concerns associated with the introduction of AI in our daily life and decisions, but it also addresses methodological difficulties that the EEG community has been having to move slowly away from the static event-based short-timeframe analyses onto a more dynamic evaluation of the unfolding of cognitive processes and their interactions. The topic of trust toward AI in cooperative decision making has also been raised by many communities, and understanding the dynamics of trust, as well as the factors modulating it, is of concern to many high-risk environments, or even everyday life contexts. Policy makers are especially interested in this kind of research output.

      Reviewer #2:

      Summary:

      The authors investigated how "AI-agent" feedback is perceived in an ambiguous classification task, and categorised the neural responses to this. They asked participants to classify real or fake faces, and presented an AI-agent's feedback afterwards, where the AI-feedback disagreed with the participants' response on a random 25% of trials (called mismatches). Pre-response ERP was sensitive to participants' classification as real or fake, while ERPs after the AI-feedback were sensitive to AI-mismatches, with stronger N2 and P3a&b components. There was an interaction of these effects, with mismatches after a "Fake" response affecting the N2 and those after "Real" responses affecting P3a&b. The ERPs were also sensitive to the participants' response biases, and their subjective ratings of the AI agent's reliability.

      Strengths:

      The researchers address an interesting question, and extend the AI-feedback paradigm to ambiguous tasks without veridical feedback, which is closer to many real-world tasks. The in-depth analysis of ERPs provides a detailed categorisation of several ERPs, as well as whole-brain responses, to AI-feedback, and how this interacts with internal beliefs, response biases, and trust in the AI-agent.

      We thank the reviewer for their time in reading and reviewing our manuscript.

      Weaknesses:

      R2.1. There is little discussion of how the poor performance (close to 50% chance) may have affected performance on the task, such as by leading to entirely random guessing or overreliance on response biases. This can change how error-monitoring signals presented, as they are affected by participants' accuracy, as well as affecting how the AI feedback is perceived.

      The images were chosen from a previous study (Nightingale & Farid, 2022, PNAS) that looked specifically at performance accuracy and also found levels around 50%. Hence, ‘fake’ and ‘real’ images are indistinguishable in this image dataset. Our findings agree with the original study.

      Judging based on the brief interviews after the experiment (see answer to R.1.1.), all participants were actively and genuinely engaged in the task, hence, it is unlikely that they pressed buttons at random. As mentioned above, we will include a formal analysis of the interviews in the revised manuscript.

      The response bias might indeed play a role in how participants responded, and this might be related to their initial propensity to trust in AI. We have questionnaire data available that might shed light on this issue: before and after the experiment, all participants answered the following questions with a 5-point Likert scale ranging from ‘Not True’ to ‘Completely True’:

      (1) Generally, I trust AI.

      (2) AI helps me solve many problems.

      (3) I think it's a good idea to rely on AI for help.

      (4) I don't trust the information I get from AI.

      (5) AI is reliable.

      (6) I rely on AI.

      The propensity to trust questionnaire is adapted from Jessup SA, Schneider T R, Alarcon GM, Ryan TJ, & Capiola A. (2019). The measurement of the propensity to trust automation. International Conference on Human-Computer Interaction.

      Our initial analyses did not find a strong link between the initial (before the experiment) responses to these questions, and how images were rated during the experiment. We will re-visit this analysis and add the results to the revised manuscript.

      Regarding how error-monitoring (or the equivalent thereof in our experiment) is perceived, we will analyse interview questions 3 (“What did you feel when you saw the AI response”) and 6 (“Did you feel surprised at any of the AI responses”) and add results to the revised manuscript.

      The task design and performance make it hard to assess how much it was truly measuring "trust" in an AI agent's feedback. The AI-feedback is yoked to the participants' performance, agreeing on 75% of trials and disagreeing on 25% (randomly), which is an important difference from the framing provided of human-AI partnerships, where AI-agents usually act independently from the humans and thus disagreements offer information about the human's own performance. In this task, disagreements are uninformative, and coupled with the at-chance performance on an ambiguous task, it is not clear how participants should be interpreting disagreements, and whether they treat it like receiving feedback about the accuracy of their choices, or whether they realise it is uninformative. Much greater discussion and justification are needed about the behaviour in the task, how participants did/should treat the feedback, and how these affect the trust/reliability ratings, as these are all central to the claims of the paper.

      In our experiment, the AI disagreements are indeed uninformative for the purpose of making a correct judgment (that is, correctly classifying images as real or fake). However, given that the AI-generated faces are so realistic and indistinguishable from the real faces, the correctness of the judgement is not the main experimental factor in this study. We argue that, provided participants were genuinely engaged in the task, their judgment accuracy is less important than their internal experience when the goal is to examine processes occurring within the participants themselves. We briefed our participants as follows before the experiment:

      “Technology can now create hyper-realistic images of people that do not exist. We are interested in your view on how well our AI system performs at identifying whether images of people’s faces are real or fake (computer-generated). Human input is needed to determine when a face looks real or fake. You will be asked to rate images as real or fake. The AI system will also independently rate the images. You will rate how reliable the AI is several times throughout the experiment.”

      We plan to more fully expand the behavioural aspect and our participants’ experience in the revised manuscript by reporting the brief post-experiment interview (R.1.1.), the propensity to trust questionnaire (R.2.1.), and additional analyses of the response times.

      There are a lot of EEG results presented here, including whole-brain and window-free analyses, so greater clarity on which results were a priori hypothesised should be given, along with details on how electrodes were selected for ERPs and follow-up tests.

      We chose the electrodes mainly to be consistent across findings, and opted to use central electrodes (Pz and Fz), as long as the electrode was part of the electrodes within the reported cluster. We can in our revised manuscript also report on the electrodes with the maximal statistic, as part of a more complete and descriptive overview. We will also report on where we expected to see ERP components within the paper. In short, we did expect something like a P3, and we did also expect to see something before the response what we call the CPP. The rest of the work was more exploratory, with a more careful expectation that bias would be connected to the CPP, and the reliability ratings more to the P3; however, we find the opposite results. We will include this in our revised work as well.

      We selected the electrodes primarily to maintain consistency across our findings and figures, and focused on central electrodes (Pz and Fz), provided they fell within the reported cluster. In the revised manuscript, we will also report the electrodes showing the maximal statistical effects to give a more complete and descriptive overview. Additionally, we will report where we expected specific ERP components to appear. In brief, we expected to see a P3 component post AI feedback, and a pre-response signal corresponding to the CPP. Beyond these expectations, the remaining analyses were more exploratory. Although we tentatively expected bias to relate to the CPP and reliability ratings to the P3, our results showed the opposite pattern. We will clarify this in the revised version of the manuscript.

      Reviewer #3:

      The current paper investigates neural correlates of trust development in human-AI interaction, looking at EEG signatures locked to the moment that AI advice is presented. The key finding is that both human-response-locked EEG signatures (the CPP) and post-AI-advice signatures (N2, P3) are modulated by trust ratings. The study is interesting, however, it does have some clear and sometimes problematic weaknesses:

      (1) The authors did not include "AI-advice". Instead, a manikin turned green or blue, which was framed as AI advice. It is unclear whether participants viewed this as actual AI advice.

      This point has been raised by the other reviewers as well, and we refer to the answers under R1.1., and under R2.1. We will address this concern by analysing the post-experiment interviews. In particular, questions 3 (“What did you feel when you saw the AI response”), 4 (“Did that change during the trials?”) and 6 (“Did you feel surprised at any of the AI responses”) will give critical insight. As stated above, our general impression from conducting the interviews is that all participants considered the robot icon as decision from an independent AI agent.

      (2) The authors did not include a "non-AI" control condition in their experiment, such that we cannot know how specific all of these effects are to AI, or just generic uncertain feedback processing.

      In the conceptualization phase of this study, we indeed considered different control conditions for our experiment to contrast different kinds of feedback. However, previous EEG studies on performance monitoring ERPs have reported similar results for human and machine supervision (Somon et al., 2019; de Visser et al., 2018). We therefore decided to focus on one aspect (the judgement of observation of an AI classification), also to prevent the experiment from taking too long and risking that participants would lose concentration and motivation to complete the experiment. Comparing AI vs non-AI feedback, is still interesting and would be a valuable follow-up study.

      Somon B, et al. "Human or not human? Performance monitoring ERPs during human agent and machine supervision." NeuroImage 186 (2019): 266-277.

      De Visser EJ, et al. "Learning from the slips of others: Neural correlates of trust in automated agents." Frontiers in human neuroscience 12 (2018): 309.

      (3) Participants perform the task at chance level. This makes it unclear to what extent they even tried to perform the task or just randomly pressed buttons. These situations likely differ substantially from a real-life scenario where humans perform an actual task (which is not impossible) and receive actual AI advice.

      This concern was also raised by the other two reviewers. As already stated in our responses above, we will add results from the post-experiment interviews with the participants, the propensity to trust questionnaire, and additional behavioural analyses in our revised manuscript.

      Reviewer 1 (R1.3) also brought up the situation where decisions by the participant and the AI have a more direct link which carries consequences. This will be valuable follow-up research. In the revised manuscript, we will more carefully frame our approach.

      (4) Many of the conclusions in the paper are overstated or very generic.

      In the revised manuscript, we will re-phrase our discussion and conclusions to address the points raised in the reviewer’s recommendations to authors.

    1. Author response:

      We appreciate thorough and highly valuable feedback from the reviewers. We will take their suggestions on board and prepare a revised manuscript focusing on the following points:

      (1) As reviewers pointed out, we did not evaluate horizontal transfer events of env-containing Ty3/gypsy elements. We consistently observed that elements found in the same phylum/class/superfamily cluster together in the POL phylogenetic tree, suggesting an ancient acquisition of env to the Ty3/gypsy elements—separation should not be as clear as we observed should they had been frequently gained from animals across different phylum/class/superfamilies. However, this does not exclude more recent horizontal transfer events that may occur between closely related species. We will perform gene-tree species-tree reconciliation analyses in clades that have enough elements and represented species to estimate the frequency of horizontal transfer events.

      (2) We did not find env-containing Ty3/gypsy elements in some animal phyla such as Echinodermata and Porifera, but this could be due to the quality or number of available genome assemblies as reviewers suggested. To address this, we will mine GAG-POL gypsy elements in the genomes that were devoid of GAG-POL-ENV elements and compare their abundance with other genomes that carry GAG-POL-ENV elements. If GAG-POL gypsy elements were similarly abundantly identified, that would indicate that the observed absence of GAG-POL-ENV elements is not due to poor quality of genome assemblies.

      (3) We will include F-type and HSV-gB type ENV proteins from known viruses in the phylogenetic analysis to investigate their ancestry and potential recombination events with env-containing Ty3/gypsy elements.

      (4) Wherever relevant, we will clarify the terms using in the manuscript, provide rationale to our selection of POL domains used for structural and phylogenetic analyses, improve accessibility of figures, touch on gypsy elements in vertebrates, and make sure all concepts covered in the results are sufficiently introduced in the introduction.

  3. Nov 2025
    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      From a forward genetic mosaic mutant screen using EMS, the authors identify mutations in glucosylceramide synthase (GlcT), a rate-limiting enzyme for glycosphingolipid (GSL) production, that result in EE tumors. Multiple genetic experiments strongly support the model that the mutant phenotype caused by GlcT loss is due to by failure of conversion of ceramide into glucosylceramide. Further genetic evidence suggests that Notch signaling is comprised in the ISC lineage and may affect the endocytosis of Delta. Loss of GlcT does not affect wing development or oogenesis, suggesting tissue-specific roles for GlcT. Finally, an increase in goblet cells in UGCG knockout mice, not previously reported, suggests a conserved role for GlcT in Notch signaling in intestinal cell lineage specification.

      Strengths:

      Overall, this is a well-written paper with multiple well-designed and executed genetic experiments that support a role for GlcT in Notch signaling in the fly and mammalian intestine. I do, however, have a few comments below.

      Weaknesses:

      (1) The authors bring up the intriguing idea that GlcT could be a way to link diet to cell fate choice. Unfortunately, there are no experiments to test this hypothesis.

      We indeed attempted to establish an assay to investigate the impact of various diets (such as high-fat, high-sugar, or high-protein diets) on the fate choice of ISCs. Subsequently, we intended to examine the potential involvement of GlcT in this process. However, we observed that the number or percentage of EEs varies significantly among individuals, even among flies with identical phenotypes subjected to the same nutritional regimen. We suspect that the proliferative status of ISCs and the turnover rate of EEs may significantly influence the number of EEs present in the intestinal epithelium, complicating the interpretation of our results. Consequently, we are unable to conduct this experiment at this time. The hypothesis suggesting that GlcT may link diet to cell fate choice remains an avenue for future experimental exploration.

      (2) Why do the authors think that UCCG knockout results in goblet cell excess and not in the other secretory cell types?

      This is indeed an interesting point. In the mouse intestine, it is well-documented that the knockout of Notch receptors or Delta-like ligands results in a classic phenotype characterized by goblet cell hyperplasia, with little impact on the other secretory cell types. This finding aligns very well with our experimental results, as we noted that the numbers of Paneth cells and enteroendocrine cells appear to be largely normal in UGCG knockout mice. By contrast, increases in other secretory cell types are typically observed under conditions of pharmacological inhibition of the Notch pathway.

      (3) The authors should cite other EMS mutagenesis screens done in the fly intestine.

      To our knowledge, the EMS screen on 2L chromosome conducted in Allison Bardin’s lab is the only one prior to this work, which leads to two publications (Perdigoto et al., 2011; Gervais, et al., 2019). We have now included citations for both papers in the revised manuscript.

      (4) The absence of a phenotype using NRE-Gal4 is not convincing. This is because the delay in its expression could be after the requirement for the affected gene in the process being studied. In other words, sufficient knockdown of GlcT by RNA would not be achieved until after the relevant signaling between the EB and the ISC occurred. Dl-Gal4 is problematic as an ISC driver because Dl is expressed in the EEP.

      This is an excellent point, and we agree that the lack of an observable phenotype using NRE-Gal4 could be due to delayed expression, which may result in missing the critical window required for effective GlcT knockdown. Consequently, we cannot rule out the possibility that GlcT also plays a role in early EBs or EEPs. We have revised the manuscript to soften this conclusion and to include this alternative explanation for the experiment.

      (5) The difference in Rab5 between control and GlcT-IR was not that significant. Furthermore, any changes could be secondary to increases in proliferation.

      We agree that it is possible that the observed increase in proliferation could influence the number of Rab5+ endosomes, and we will temper our conclusions on this aspect accordingly. However, it is important to note that, although the difference in Rab5+ endosomes between the control and GlcT-IR conditions appeared mild, it was statistically significant and reproducible. In our revised experiments, we have not only added statistical data and immunofluorescence images for Rab11 but also unified the approaches used for detecting Rab-associated proteins (in the previous figures, Rab5 was shown using U-Rab5-GFP, whereas Rab7 was detected by direct antibody staining). Based on this unified strategy, we optimized the quantification of Dl-GFP colocalization with early, late, and recycling endosomes, and the results are consistent with our previous observations (see the updated Fig. 5).

      Reviewer #2 (Public review):

      Summary:

      This study genetically identifies two key enzymes involved in the biosynthesis of glycosphingolipids, GlcT and Egh, which act as tumor suppressors in the adult fly gut. Detailed genetic analysis indicates that a deficiency in Mactosyl-ceramide (Mac-Cer) is causing tumor formation. Analysis of a Notch transcriptional reporter further indicates that the lack of Mac-Ser is associated with reduced Notch activity in the gut, but not in other tissues.

      Addressing how a change in the lipid composition of the membranes might lead to defective Notch receptor activation, the authors studied the endocytic trafficking of Delta and claimed that internalized Delta appeared to accumulate faster into endosomes in the absence of Mac-Cer. Further analysis of Delta steady-state accumulation in fixed samples suggested a delay in the endosomal trafficking of Delta from Rab5+ to Rab7+ endosomes, which was interpreted to suggest that the inefficient, or delayed, recycling of Delta might cause a loss in Notch receptor activation.

      Finally, the histological analysis of mouse guts following the conditional knock-out of the GlcT gene suggested that Mac-Cer might also be important for proper Notch signaling activity in that context.

      Strengths:

      The genetic analysis is of high quality. The finding that a Mac-Cer deficiency results in reduced Notch activity in the fly gut is important and fully convincing.

      The mouse data, although preliminary, raised the possibility that the role of this specific lipid may be conserved across species.

      Weaknesses:

      This study is not, however, without caveats and several specific conclusions are not fully convincing.

      First, the conclusion that GlcT is specifically required in Intestinal Stem Cells (ISCs) is not fully convincing for technical reasons: NRE-Gal4 may be less active in GlcT mutant cells, and the knock-down of GlcT using Dl-Gal4ts may not be restricted to ISCs given the perdurance of Gal4 and of its downstream RNAi.

      As previously mentioned, we acknowledge that a role for GlcT in early EBs or EEPs cannot be completely ruled out. We have revised our manuscript to present a more cautious conclusion and explicitly described this possibility in the updated version.

      Second, the results from the antibody uptake assays are not clear.: i) the levels of internalized Delta were not quantified in these experiments; ii) additionally, live guts were incubated with anti-Delta for 3hr. This long period of incubation indicated that the observed results may not necessarily reflect the dynamics of endocytosis of antibody-bound Delta, but might also inform about the distribution of intracellular Delta following the internalization of unbound anti-Delta. It would thus be interesting to examine the level of internalized Delta in experiments with shorter incubation time.

      We thank the reviewer for these excellent questions. In our antibody uptake experiments, we noted that Dl reached its peak accumulation after a 3-hour incubation period. We recognize that quantifying internalized Dl would enhance our analysis, and we will include the corresponding statistical graphs in the revised version of the manuscript. In addition, we agree that during the 3-hour incubation, the potential internalization of unbound anti-Dl cannot be ruled out, as it may influence the observed distribution of intracellular Dl. We therefore attempted to supplement our findings with live imaging experiments to investigate the dynamics of Dl/Notch endocytosis in both normal and GlcT mutant ISCs. However, we found that the GFP expression level of Dl-GFP (either in the knock-in or transgenic line) was too low to be reliably tracked. During the three-hour observation period, the weak GFP signal remained largely unchanged regardless of the GlcT mutation status, and the signal resolution under the microscope was insufficient to clearly distinguish membrane-associated from intracellular Dl. Therefore, we were unable to obtain a dynamic view of Dl trafficking through live imaging. Nevertheless, our Dl antibody uptake and endosomal retention analyses collectively support the notion that MacCer influences Notch signaling by regulating Dl endocytosis.

      Overall, the proposed working model needs to be solidified as important questions remain open, including: is the endo-lysosomal system, i.e. steady-state distribution of endo-lysosomal markers, affected by the Mac-Cer deficiency? Is the trafficking of Notch also affected by the Mac-Cer deficiency? is the rate of Delta endocytosis also affected by the Mac-Cer deficiency? are the levels of cell-surface Delta reduced upon the loss of Mac-Cer?

      Regarding the impact on the endo-lysosomal system, this is indeed an important aspect to explore. While we did not conduct experiments specifically designed to evaluate the steady-state distribution of endo-lysosomal markers, our analyses utilizing Rab5-GFP overexpression and Rab7 staining did not indicate any significant differences in endosome distribution in MacCer deficient conditions. Moreover, we still observed high expression of the NRE-LacZ reporter specifically at the boundaries of clones in GlcT mutant cells (Fig. 4A), indicating that GlcT mutant EBs remain responsive to Dl produced by normal ISCs located right at the clone boundary. Therefore, we propose that MacCer deficiency may specifically affect Dl trafficking without impacting Notch trafficking.

      In our 3-hour antibody uptake experiments, we observed a notable decrease in cell-surface Dl, which was accompanied by an increase in intracellular accumulation. These findings collectively suggest that Dl may be unstable on the cell surface, leading to its accumulation in early endosomes.

      Third, while the mouse results are potentially interesting, they seem to be relatively preliminary, and future studies are needed to test whether the level of Notch receptor activation is reduced in this model.

      In the mouse small intestine, Olfm4 is a well-established target gene of the Notch signaling pathway, and its staining provides a reliable indication of Notch pathway activation. While we attempted to evaluate Notch activation using additional markers, such as Hes1 and NICD, we encountered difficulties, as the corresponding antibody reagents did not perform well in our hands. Despite these challenges, we believe that our findings with Olfm4 provide an important start point for further investigation in the future.

      Reviewer #3 (Public review):

      Summary:

      In this paper, Tang et al report the discovery of a Glycoslyceramide synthase gene, GlcT, which they found in a genetic screen for mutations that generate tumorous growth of stem cells in the gut of Drosophila. The screen was expertly done using a classic mutagenesis/mosaic method. Their initial characterization of the GlcT alleles, which generate endocrine tumors much like mutations in the Notch signaling pathway, is also very nice. Tang et al checked other enzymes in the glycosylceramide pathway and found that the loss of one gene just downstream of GlcT (Egh) gives similar phenotypes to GlcT, whereas three genes further downstream do not replicate the phenotype. Remarkably, dietary supplementation with a predicted GlcT/Egh product, Lactosyl-ceramide, was able to substantially rescue the GlcT mutant phenotype. Based on the phenotypic similarity of the GlcT and Notch phenotypes, the authors show that activated Notch is epistatic to GlcT mutations, suppressing the endocrine tumor phenotype and that GlcT mutant clones have reduced Notch signaling activity. Up to this point, the results are all clear, interesting, and significant. Tang et al then go on to investigate how GlcT mutations might affect Notch signaling, and present results suggesting that GlcT mutation might impair the normal endocytic trafficking of Delta, the Notch ligand. These results (Fig X-XX), unfortunately, are less than convincing; either more conclusive data should be brought to support the Delta trafficking model, or the authors should limit their conclusions regarding how GlcT loss impairs Notch signaling. Given the results shown, it's clear that GlcT affects EE cell differentiation, but whether this is via directly altering Dl/N signaling is not so clear, and other mechanisms could be involved. Overall the paper is an interesting, novel study, but it lacks somewhat in providing mechanistic insight. With conscientious revisions, this could be addressed. We list below specific points that Tang et al should consider as they revise their paper.

      Strengths:

      The genetic screen is excellent.

      The basic characterization of GlcT phenotypes is excellent, as is the downstream pathway analysis.

      Weaknesses:

      (1) Lines 147-149, Figure 2E: here, the study would benefit from quantitations of the effects of loss of brn, B4GalNAcTA, and a4GT1, even though they appear negative.

      We have incorporated the quantifications for the effects of the loss of brn, B4GalNAcTA, and a4GT1 in the updated Figure 2.

      (2) In Figure 3, it would be useful to quantify the effects of LacCer on proliferation. The suppression result is very nice, but only effects on Pros+ cell numbers are shown.

      We have now added quantifications of the number of EEs per clone to the updated Figure 3.

      (3) In Figure 4A/B we see less NRE-LacZ in GlcT mutant clones. Are the data points in Figure 4B per cell or per clone? Please note. Also, there are clearly a few NRE-LacZ+ cells in the mutant clone. How does this happen if GlcT is required for Dl/N signaling?

      In Figure 4B, the data points represent the fluorescence intensity per single cell within each clone. It is true that a few NRE-LacZ+ cells can still be observed within the mutant clone; however, this does not contradict our conclusion. As noted, high expression of the NRE-LacZ reporter was specifically observed around the clone boundaries in MacCer deficient cells (Fig. 4A), indicating that the mutant EBs can normally receive Dl signal from the normal ISCs located at the clone boundary and activate the Notch signaling pathway. Therefore, we believe that, although affecting Dl trafficking, MacCer deficiency does not significantly affect Notch trafficking.

      (4) Lines 222-225, Figure 5AB: The authors use the NRE-Gal4ts driver to show that GlcT depletion in EBs has no effect. However, this driver is not activated until well into the process of EB commitment, and RNAi's take several days to work, and so the author's conclusion is "specifically required in ISCs" and not at all in EBs may be erroneous.

      As previously mentioned, we acknowledge that a role for GlcT in early EBs or EEPs cannot be completely ruled out. We have revised our manuscript to present a more cautious conclusion and described this possibility in the updated version.

      (5) Figure 5C-F: These results relating to Delta endocytosis are not convincing. The data in Fig 5C are not clear and not quantitated, and the data in Figure 5F are so widely scattered that it seems these co-localizations are difficult to measure. The authors should either remove these data, improve them, or soften the conclusions taken from them. Moreover, it is unclear how the experiments tracing Delta internalization (Fig 5C) could actually work. This is because for this method to work, the anti-Dl antibody would have to pass through the visceral muscle before binding Dl on the ISC cell surface. To my knowledge, antibody transcytosis is not a common phenomenon.

      We thank the reviewer for these insightful comments and suggestions. In our in vivo experiments, we observed increased co-localization of Rab5 and Dl in GlcT mutant ISCs, indicating that Dl trafficking is delayed at the transition to Rab7⁺ late endosomes, a finding that is further supported by our antibody uptake experiments. We acknowledge that the data presented in Fig. 5C are not fully quantified and that the co-localization data in Fig. 5F may appear somewhat scattered; therefore, we have included additional quantification and enhanced the data presentation in the revised manuscript.

      Regarding the concern about antibody internalization, we appreciate this point. We currently do not know if the antibody reaches the cell surface of ISCs by passing through the visceral muscle or via other routes. Given that the experiment was conducted with fragmented gut, it is possible that the antibody may penetrate into the tissue through mechanisms independent of transcytosis.

      As mentioned earlier, we attempted to supplement our findings with live imaging experiments to investigate the dynamics of Dl/Notch endocytosis in both normal and GlcT mutant ISCs. However, we found that the GFP expression level of Dl-GFP (either in the knock-in or transgenic line) was too low to be reliably tracked. During the three-hour observation period, the weak GFP signal remained largely unchanged regardless of the GlcT mutation status, and the signal resolution under the microscope was insufficient to clearly distinguish membrane-associated from intracellular Dl. Therefore, we were unable to obtain a dynamic view of Dl trafficking through live imaging. Nevertheless, our Dl antibody uptake and endosomal retention analyses collectively support the notion that MacCer influences Notch signaling by regulating Dl endocytosis.

      (6) It is unclear whether MacCer regulates Dl-Notch signaling by modifying Dl directly or by influencing the general endocytic recycling pathway. The authors say they observe increased Dl accumulation in Rab5+ early endosomes but not in Rab7+ late endosomes upon GlcT depletion, suggesting that the recycling endosome pathway, which retrieves Dl back to the cell surface, may be impaired by GlcT loss. To test this, the authors could examine whether recycling endosomes (marked by Rab4 and Rab11) are disrupted in GlcT mutants. Rab11 has been shown to be essential for recycling endosome function in fly ISCs.

      We agree that assessing the state of recycling endosomes, especially by using markers such as Rab11, would be valuable in determining whether MacCer regulates Dl-Notch signaling by directly modifying Dl or by influencing the broader endocytic recycling pathway. In the newly added experiments, we found that in GlcT-IR flies, Dl still exhibits partial colocalization with Rab11, and the overall expression pattern of Rab11 is not affected by GlcT knockdown (Fig. 5E-F). These observations suggest that MacCer specifically regulates Dl trafficking rather than broadly affecting the recycling pathway.

      (7) It remains unclear whether Dl undergoes post-translational modification by MacCer in the fly gut. At a minimum, the authors should provide biochemical evidence (e.g., Western blot) to determine whether GlcT depletion alters the protein size of Dl.

      While we propose that MacCer may function as a component of lipid rafts, facilitating Dl membrane anchorage and endocytosis, we also acknowledge the possibility that MacCer could serve as a substrate for protein modifications of Dl necessary for its proper function. Conducting biochemical analyses to investigate potential post-translational modifications of Dl by MacCer would indeed provide valuable insights. We have performed Western blot analysis to test whether GlcT depletion affects the protein size of Dl. As shown below, we did not detect any apparent changes in the molecular weight of the Dl protein. Therefore, it is unlikely that MacCer regulates post-translational modifications of Dl.

      Author response image 1.

      To investigate whether MacCer modifies Dl by Western blot,(A) Four lanes were loaded: the first two contained 20 μL of membrane extract (lane 1: GlcT-IR, lane 2: control), while the last two contained 10 μL of membrane extract (B) Full blot images are shown under both long and shortexposure conditions.

      (8) It is unfortunate that GlcT doesn't affect Notch signaling in other organs on the fly. This brings into question the Delta trafficking model and the authors should note this. Also, the clonal marker in Figure 6C is not clear.

      In the revised working model, we have explicitly described that the events occur in intestinal stem cells. Regarding Figure 6C, we have delineated the clone with a white dashed line to enhance its clarity and visual comprehension.

      (9) The authors state that loss of UGCG in the mouse small intestine results in a reduced ISC count. However, in Supplementary Figure C3, Ki67, a marker of ISC proliferation, is significantly increased in UGCG-CKO mice. This contradiction should be clarified. The authors might repeat this experiment using an alternative ISC marker, such as Lgr5.

      Previous studies have indicated that dysregulation of the Notch signaling pathway can result in a reduction in the number of ISCs. While we did not perform a direct quantification of ISC numbers in our experiments, our Olfm4 staining—which serves as a reliable marker for ISCs—demonstrates a clear reduction in the number of positive cells in UGCG-CKO mice.

      The increased Ki67 signal we observed reflects enhanced proliferation in the transit-amplifying region, and it does not directly indicate an increase in ISC number. Therefore, in UGCG-CKO mice, we observe a decrease in the number of ISCs, while there is an increase in transit-amplifying (TA) cells (progenitor cells). This increase in TA cells is probably a secondary consequence of the loss of barrier function associated with the UGCG knockout.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      The study analyzes the gastric fluid DNA content identified as a potential biomarker for human gastric cancer. However, the study lacks overall logicality, and several key issues require improvement and clarification. In the opinion of this reviewer, some major revisions are needed:

      (1) This manuscript lacks a comparison of gastric cancer patients' stages with PN and N+PD patients, especially T0-T2 patients.

      We are grateful for this astute remark. A comparison of gfDNA concentration among the diagnostic groups indicates a trend of increasing values as the diagnosis progresses toward malignancy. The observed values for the diagnostic groups are as follows:

      Author response table 1.

      The chart below presents the statistical analyses of the same diagnostic/tumor-stage groups (One-Way ANOVA followed by Tukey’s multiple comparison tests). It shows that gastric fluid gfDNA concentrations gradually increase with malignant progression. We observed that the initial tumor stages (T0 to T2) exhibit intermediate gfDNA levels, which in this group is significantly lower than in advanced disease (p = 0.0036), but not statistically different from non-neoplastic disease (p = 0.74).

      Author response image 1.

      (2) The comparison between gastric cancer stages seems only to reveal the difference between T3 patients and early-stage gastric cancer patients, which raises doubts about the authenticity of the previous differences between gastric cancer patients and normal patients, whether it is only due to the higher number of T3 patients.

      We appreciate the attention to detail regarding the numbers analyzed in the manuscript. Importantly, the results are meaningful because the number of subjects in each group is comparable (T0-T2, N = 65; T3, N = 91; T4, N = 63). The mean gastric fluid gfDNA values (ng/µL) increase with disease stage (T0-T2: 15.12; T3-T4: 30.75), and both are higher than the mean gfDNA values observed in non-neoplastic disease (10.81 ng/µL for N+PD and 10.10 ng/µL for PN). These subject numbers in each diagnostic group accurately reflect real-world data from a tertiary cancer center.

      (3) The prognosis evaluation is too simplistic, only considering staging factors, without taking into account other factors such as tumor pathology and the time from onset to tumor detection.

      Histopathological analyses were performed throughout the study not only for the initial diagnosis of tissue biopsies, but also for the classification of Lauren’s subtypes, tumor staging, and the assessment of the presence and extent of immune cell infiltrates. Regarding the time of disease onset, this variable is inherently unknown--by definition--at the time of a diagnostic EGD. While the prognosis definition is indeed straightforward, we believe that a simple, cost-effective, and practical approach is advantageous for patients across diverse clinical settings and is more likely to be effectively integrated into routine EGD practice.

      (4) The comparison between gfDNA and conventional pathological examination methods should be mentioned, reflecting advantages such as accuracy and patient comfort.

      We wish to reinforce that EGD, along with conventional histopathology, remains the gold standard for gastric cancer evaluation. EGD under sedation is routinely performed for diagnosis, and the collection of gastric fluids for gfDNA evaluation does not affect patient comfort. Thus, while gfDNA analysis was evidently not intended as a diagnostic EGD and biopsy replacement, it may provide added prognostic value to this exam.

      (5) There are many questions in the figures and tables. Please match the Title, Figure legends, Footnote, Alphabetic order, etc.

      We are grateful for these comments and apologize for the clerical oversight. All figures, tables, titles and figure legends have now been double-checked.

      (6) The overall logicality of the manuscript is not rigorous enough, with few discussion factors, and cannot represent the conclusions drawn.

      We assume that the unusual wording remark regarding “overall logicality” pertains to the rationale and/or reasoning of this investigational study. Our working hypothesis was that during neoplastic disease progression, tumor cells continuously proliferate and, depending on various factors, attract immune cell infiltrates. Consequently, both tumor cells and immune cells (as well as tumor-derived DNA) are released into the fluids surrounding the tumor at its various locations, including blood, urine, saliva, gastric fluids, and others. Thus, increases in DNA levels within some of these fluids have been documented and are clinically meaningful. The concurrent observation of elevated gastric fluid gfDNA levels and immune cell infiltration supports the hypothesis that increased gfDNA—which may originate not only from tumor cells but also from immune cells—could be associated with better prognosis, as suggested by this study of a large real-world patient cohort.

      In summary, we thank Reviewer #1 for his time and effort in a constructive critique of our work.

      Reviewer #2 (Public review):

      Summary:

      The authors investigated whether the total DNA concentration in gastric fluid (gfDNA), collected via routine esophagogastroduodenoscopy (EGD), could serve as a diagnostic and prognostic biomarker for gastric cancer. In a large patient cohort (initial n=1,056; analyzed n=941), they found that gfDNA levels were significantly higher in gastric cancer patients compared to non-cancer, gastritis, and precancerous lesion groups. Unexpectedly, higher gfDNA concentrations were also significantly associated with better survival prognosis and positively correlated with immune cell infiltration. The authors proposed that gfDNA may reflect both tumor burden and immune activity, potentially serving as a cost-effective and convenient liquid biopsy tool to assist in gastric cancer diagnosis, staging, and follow-up.

      Strengths:

      This study is supported by a robust sample size (n=941) with clear patient classification, enabling reliable statistical analysis. It employs a simple, low-threshold method for measuring total gfDNA, making it suitable for large-scale clinical use. Clinical confounders, including age, sex, BMI, gastric fluid pH, and PPI use, were systematically controlled. The findings demonstrate both diagnostic and prognostic value of gfDNA, as its concentration can help distinguish gastric cancer patients and correlates with tumor progression and survival. Additionally, preliminary mechanistic data reveal a significant association between elevated gfDNA levels and increased immune cell infiltration in tumors (p=0.001).

      Reviewer #2 has conceptually grasped the overall rationale of the study quite well, and we are grateful for their assessment and comprehensive summary of our findings.

      Weaknesses:

      (1) The study has several notable weaknesses. The association between high gfDNA levels and better survival contradicts conventional expectations and raises concerns about the biological interpretation of the findings.

      We agree that this would be the case if the gfDNA was derived solely from tumor cells. However, the findings presented here suggest that a fraction of this DNA would be indeed derived from infiltrating immune cells. The precise determination of the origin of this increased gfDNA remains to be achieved in future follow-up studies, and these are planned to be evaluated soon, by applying DNA- and RNA-sequencing methodologies and deconvolution analyses.

      (2) The diagnostic performance of gfDNA alone was only moderate, and the study did not explore potential improvements through combination with established biomarkers. Methodological limitations include a lack of control for pre-analytical variables, the absence of longitudinal data, and imbalanced group sizes, which may affect the robustness and generalizability of the results.

      Reviewer #2 is correct that this investigational study was not designed to assess the diagnostic potential of gfDNA. Instead, its primary contribution is to provide useful prognostic information. In this regard, we have not yet explored combining gfDNA with other clinically well-established diagnostic biomarkers. We do acknowledge this current limitation as a logical follow-up that must be investigated in the near future.

      Moreover, we collected a substantial number of pre-analytical variables within the limitations of a study involving over 1,000 subjects. Longitudinal samples and data were not analyzed here, as our aim was to evaluate prognostic value at diagnosis. Although the groups are imbalanced, this accurately reflects the real-world population of a large endoscopy center within a dedicated cancer facility. Subjects were invited to participate and enter the study before sedation for the diagnostic EGD procedure; thus, samples were collected prospectively from all consenting individuals.

      Finally, to maintain a large, unbiased cohort, we did not attempt to balance the groups, allowing analysis of samples and data from all patients with compatible diagnoses (please see Results: Patient groups and diagnoses).

      (3) Additionally, key methodological details were insufficiently reported, and the ROC analysis lacked comprehensive performance metrics, limiting the study's clinical applicability.

      We are grateful for this useful suggestion. In the current version, each ROC curve (Supplementary Figures 1A and 1B) now includes the top 10 gfDNA thresholds, along with their corresponding sensitivity and specificity values (please see Suppl. Table 1). The thresholds are ordered from-best-to-worst based on the classic Youden’s J statistic, as follows:

      Youden Index = specificity + sensitivity – 1 [Youden WJ. Index for rating diagnostic tests. Cancer 3:32-35, 1950. PMID: 15405679]. We have made an effort to provide all the key methodological details requested, but we would be glad to add further information upon specific request.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #3 (Recommendations for the authors):

      The authors have done an excellent job of addressing most comments, but my concerns about Figure 5 remain. I appreciate the authors' efforts to address the problem involving Rs being part of the computation on both the x and y axes of Figure 5, but addressing this via simulation addresses statistical significance but overlooks effect size. I think the authors may have misunderstood my original suggestion, so I will attempt to explain it better here. Since "Rs" is an average across all trials, the trials could be subdivided in two halves to compute two separate averages - for example, an average of the even numbered trials and an average of the odd numbered trials. Then you would use the "Rs" from the even numbered trials for one axis and the "Rs" from the odd numbered trials for the other. You would then plot R-Rs_even vs Rf-Rs_odd. This would remove the confound from this figure, and allow the text/interpretation to be largely unchanged (assuming the results continue to look as they do).

      We have added a description and the result of the new analysis (line #321 to #332), and a supplementary figure (Suppl. Fig. 1) (line #1464 to #1477). 

      “We calculated 𝑅<sub>𝑠</sub> in the ordinate and abscissa of Figure 5A-E using responses averaged across different subsets of trials, such that 𝑅<sub>𝑠</sub> was no longer a common term in the ordinate and abscissa. For each neuron, we determined 𝑅<sub>𝑠1</sub> by averaging the firing rates of 𝑅<sub>𝑠</sub> across half of the recorded trials, selected randomly. We also determined 𝑅<sub>𝑠2</sub> by averaging the firing rates of 𝑅<sub>𝑠</sub> across the rest of the trials.  We regressed (𝑅 − 𝑅<sub>𝑠1</sub> )  on (𝑅<sub>𝑓</sub> − 𝑅<sub>𝑠2</sub>) , as well as (𝑅<sub>𝑠</sub> - 𝑅<sub>𝑠2</sub>)  on (𝑅<sub>𝑓</sub> − 𝑅<sub>𝑠1</sub>), and repeated the procedure 50 times. The averaged slopes obtained with 𝑅<sub>𝑠</sub> from the split trials showed the same pattern as those using 𝑅<sub>𝑠</sub> from all trials (Table 1 and Supplementary Fig. 1), although the coefficient of determination was slightly reduced (Table 1). For ×4 speed separation, the slopes were nearly identical to those shown in Figure 5F1. For ×2 speed separation, the slopes were slightly smaller than those in Figure 5F2, but followed the same pattern (Supplementary Fig. 1). Together, these analysis results confirmed the faster-speed bias at the slow stimulus speeds, and the change of the response weights as stimulus speeds increased.”

      An additional remaining item concerns the terminology weighted sum, in the context of the constraint that wf and ws must sum to one. My opinion is that it is non-standard to use weighted sum when the computation is a weighted average, but as long as the authors make their meaning clear, the reader will be able to follow. I suggest adding some phrasing to explain to the reader the shift in interpretation from the more general weighted sum to the more constrained weighted average. Specifically, "weighted sum" first appears on line 268, and then the additional constraint of ws + wf =1 is introduced on line 278. Somewhere around line 278, it would be useful to include a sentence stating that this constraint means the weighted sum is constrained to be a weighted average.

      Thanks for the suggestion. We have modified the text as follows. Since we made other modifications in the text, the line numbers are slightly different from the last version. 

      Line #274 to 275: 

      “Since it is not possible to solve for both variables, 𝑤<sub>𝑠</sub> and 𝑤<sub>𝑓</sub>, from a single equation (Eq. 5) with three data points, we introduced an additional constraint: 𝑤<sub>𝑠</sub> + 𝑤<sub>𝑓</sub> =1. With this constraint, the weighted sum becomes a weighted average.”

      Also on line #309:

      “First, at each speed pair and for each of the 100 neurons in the data sample shown in Figure 5, we simulated the response to the bi-speed stimuli (𝑅<sub>𝑒</sub>) as a randomly weighted average of 𝑅<sub>𝑓</sub> and 𝑅<sub>𝑠</sub> of the same neuron. 

      in which 𝑎 was a randomly generated weight (between 0 and 1) for 𝑅<sub>𝑓</sub>, and the weights for 𝑅<sub>𝑓</sub> and 𝑅<sub>𝑠</sub> summed to one.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The authors focus on the molecular mechanisms by which EMT cells confer resistance to cancer cells. The authors use a wide range of methods to reveal that overexpression of Snail in EMT cells induces cholesterol/sphingomyelin imbalance via transcriptional repression of biosynthetic enzymes involved in sphingomyelin synthesis. The study also revealed that ABCA1 is important for cholesterol efflux and thus for counterbalancing the excess of intracellular free cholesterol in these snail-EMT cells. Inhibition of ACAT, an enzyme catalyzing cholesterol esterification, also seems essential to inhibit the growth of snail-expressing cancer cells.

      However, It seems important to analyze the localization of ABCA1, as it is possible that in the event of cholesterol/sphingomyelin imbalance, for example, the intracellular trafficking of the pump may be altered.

      The authors should also analyze ACAT levels and/or activity in snail-EMT cells that should be increased. Overall, the provided data are important to better understand cancer biology.

      We thank the reviewer for recognizing the significance of our study. Consistent with the hypothesis that ABCA1 contributes to chemoresistance in hybrid E/M cells, we agree that demonstrating the localization of ABCA1 at the plasma membrane is important, and we have included additional experiments to address this point.

      We also examined the expression of the major ACAT isoform in the kidney, SOAT1, across RCC cell lines. However, its expression did not correlate with that of Snail (Figure 4B), suggesting that SOAT1 is constitutively expressed at a certain level regardless of Snail expression. The details of these additional experiments are provided in the point-by-point responses below.

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors discovered that the chemoresistance in RCC cell lines correlates with the expression levels of the drug transporter ABCA1 and the EMT-related transcription factor Snail. They demonstrate that Snail induces ABCA1 expression and chemoresistance, and that ABCA1 inhibitors can counteract this resistance. The study also suggests that Snail disrupts the cholesterol-sphingomyelin (Chol/SM) balance by repressing the expression of enzymes involved in very long-chain fatty acid-sphingomyelin synthesis, leading to excess free cholesterol. This imbalance activates the cholesterol-LXR pathway, inducing ABCA1 expression. Moreover, inhibiting cholesterol esterification suppresses Snail-positive cancer cell growth, providing potential lipid-targeting strategies for invasive cancer therapy.

      Strengths:

      This research presents a novel mechanism by which the EMT-related transcription factor Snail confers drug resistance by altering the Chol/SM balance, introducing a previously unrecognized role of lipid metabolism in the chemoresistance of cancer cells. The focus on lipid balance, rather than individual lipid levels, is a particularly insightful approach. The potential for targeting cholesterol detoxification pathways in Snail-positive cancer cells is also a significant therapeutic implication.

      Weaknesses:

      The study's claim that Snail-induced ABCA1 is crucial for chemoresistance relies only on pharmacological inhibition of ABCA1, lacking additional validation. The causal relationship between the disrupted Chol/SM balance and ABCA1 expression or chemoresistance is not directly supported by data. Some data lack quantitative analysis.

      We thank the reviewer for his/her insightful and constructive comments. In response, we have performed additional experiments using complementary approaches to further substantiate the contribution of Snail-induced ABCA1 expression to chemoresistance. Furthermore, to clarify the causal relationship between reduced sphingomyelin biosynthesis and ABCA1 expression, we conducted new experiments showing that supplementation with sphingolipids attenuates ABCA1 upregulation (Figure 3H). The details of these additional experiments are described in the point-by-point responses below.

      Reviewer #1 (Recommendations for the authors):

      In this paper, the authors reveal that snail expression in EMT-cells leads to an imbalance between cholesterol and sphingomyelin via a transcriptional repression of enzymes involved in the biosynthesis of sphingomyelin.

      This paper is interesting and highlights how the imbalance of lipids would impact chemotherapy resistance. However, I have a few comments.

      In Figure 2 in Eph4 cells, while filipin staining appears exclusively at the plasma membrane in the case of EpH4-snail cells filipin staining is also intracellular. It seems plausible that all filipin-positive intracellular staining is not exclusively in LDs, authors should therefore try to colocalize filipin with other intracellular markers. To this aim, authors might want to use topfluocholesterol-probe for instance.

      We examined the distribution of TopFluor-cholesterol in hybrid E/M cells (Figure 2H) and found that TopFluor-cholesterol colocalizes with lipid droplets. In addition, we analyzed the colocalization between intracellular filipin signals and organelle-specific proteins, ADRP (lipid droplets) and LAMP1 (lysosomes) (Figure 2I). Since filipin binds exclusively to unesterified cholesterol, filipin signals did not colocalize with ADRP. Instead, we observed colocalization of filipin with LAMP1, suggesting that cholesterol accumulates in hybrid E/M cells in both esterified and unesterified forms.

      In Figure 3, the authors reveal that the exogenous expression of the snail alters the ratio of cholesterol to sphingomyelin. The authors should reveal where is found the intracellular cholesterol and intracellular sphingomyelin within these cells Eph4-snail.

      To investigate the lipid composition of the plasma membrane, we utilized lipid-binding protein probes, D4 (for cholesterol) and lysenin (for sphingomyelin) (Figures 2L and 2M). We found that the plasma membrane cholesterol content was not affected by EMT, whereas sphingomyelin levels were markedly decreased. In addition, intracellular cholesterol was visualized (Comment 1-1; Figures 2E–2K). On the other hand, because visualization of intracellular sphingomyelin is technically challenging, we were unable to include this analysis in the present study. We consider this an important direction for future investigation.

      Regarding the model described in panel K of Figure 3. I would expect that the changes in lipid-membrane organization depicted in panel K should affect the pattern of GM1 toxin for instance or the motility of raft-associated proteins for instance. The authors could perform these experiments in order to sustain the change of lipid plasma membrane organization.

      We attempted staining with FITC–cholera toxin to visualize GM1, but both EpH4 and EpH4–Snail cells exhibited very low levels of GM1, resulting in minimal or no detectable staining (data not shown). Instead, to assess the impact of decreased sphingomyelin on the overall biophysical properties of the plasma membrane, we used a plasma membrane–specific lipid-order probe, FπCM–SO₃ (Figures 2N–2P and Figure 2—figure supplement 3). We found that the plasma membrane of EpH4–Snail cells was more disordered (fluidized), suggesting that the overall properties of the plasma membrane are altered by ectopic expression of Snail.

      Another issue is the intracellular localization of ABCA1 in Eph4-Snail cells. Knowing that a change in the cholesterol/sphingomyelin ratio can also modify intracellular protein trafficking, it seems important to analyze the intracellular localization of ABCA1 in EPh4-Snail cells.

      We performed immunofluorescence microscopy for ABCA1 and found that ABCA1 was mainly localized at the plasma membrane in EpH4–Snail cells (Figure 1M).

      As for the data on ACAT inhibition, we expect an increase in ACAT activity and protein levels in EMT cells overexpressing Snail. The authors should also investigate this point.

      As noted in our response to the public review, we examined the expression of the major ACAT isoform in the kidney, SOAT1, across RCC cell lines. However, its expression did not correlate with Snail (Figure 4B), suggesting that SOAT1 is expressed at sufficient levels even in cells with low Snail expression. We agree that measuring ACAT activity would be important, as ACATs are regulated at multiple levels. However, we consider this to be beyond the scope of the present study and plan to address it in future work.

      Minor comments

      I do not understand why in the text, Figure S1 appears after Figure S2. The authors might want to change the numbering of these two figures.

      We thank the reviewer for pointing this out. We have corrected the numbering of the supplementary figures so that Figure S1 now appears before Figure S2 in both the text and the revised figure legends.

      Page 5, lane 20 Figure 1I instead of 1H.

      Page 6, lane 2, Figure 1J instead of 1I, and lane 9 Figure 1H instead of 1I.

      We thank the reviewer for carefully checking the figure references. We have corrected the figure numbering errors in the text as suggested.

      Reviewer #2 (Recommendations for the authors):

      For Figures 1B, 1H, 1J, 2B, 2C, 3G, S3A, and S3B, to enhance data reliability, it is necessary to conduct a quantitative analysis of the Western blot data. The average values from at least three biological replicates should be calculated, with statistical significance assessed.

      We have conducted quantitative analyses of the Western blot data for Figures 1B, 1H, 1J, 2B, 2C, 3G, S3A, and S3B. Band intensities from at least three independent biological replicates were quantified, and the mean values with statistical significance are now presented in the revised figures.

      For Figures 1D, 2A, 2D, and S2, the images of cells or tissues should not rely solely on selected fields. Quantitative analysis is required, and the mean values from at least three biological replicates should be provided with statistical significance testing.

      We have performed quantitative analyses for Figures 1D, 2A, 2D, and S2. The quantification was based on data from at least three independent biological replicates, and the mean values with statistical significance are now included in the revised figures.

      For Figures 1A, 1G, 4, and S5, evaluating ABCA1's involvement in drug resistance based solely on CsA treatment is insufficient. Demonstrating the loss of drug resistance through ABCA1 knockdown or knockout is necessary.

      We generated ABCA1 knockout EpH4–Snail cells and examined their resistance to nitidine chloride. However, knockout of ABCA1 alone did not affect resistance to the compound (Figure 2 - figure supplement 2). This may be due to secondary metabolic alterations induced by ABCA1 loss or compensatory upregulation of other LXR-induced cholesterol efflux transporters. Instead, we demonstrated that treatment with the LXR inhibitor GSK2033 reduced the nitidine chloride resistance of EpH4–Snail cells (Figure 2C), supporting the idea that enhanced efflux of antitumor agents through the LXR–ABCA1–mediated cholesterol efflux pathway contributes to nitidine chloride resistance.

      For Figure 3, to establish a causal relationship between changes in the Chol/SM balance and ABCA1 expression, it is important to test whether modifying cholesterol and SM levels to disrupt this balance affects ABCA1 expression.

      Regarding causality, as shown in Figure 2, we have already demonstrated that reducing cholesterol levels in EpH4–Snail cells decreases ABCA1 expression. To further explore this relationship, we examined whether increasing sphingomyelin levels by adding ceramide to the culture medium—thereby restoring the sphingomyelin-to-cholesterol ratio—would reduce ABCA1 expression (Figure 3H). Indeed, supplementation with C22:0 ceramide decreased ABCA1 expression, suggesting that downregulation of the VLCFA-sphingomyelin biosynthetic pathway triggers ABCA1 upregulation. Collectively, these findings support a causal relationship between the Chol/SM balance and ABCA1 expression.

      In Figure 3, if there is any information on differences in cholesterol affinity between LCFA-SM and VLCFA-SM, it would be beneficial to include it in the manuscript.

      Differences in cholesterol affinity between LCFA-SM and VLCFA-SM in cellular membranes remain controversial and have yet to be fully elucidated. The decrease in cell surface sphingomyelin content, evaluated by lysenin staining (Figure 2L), was more pronounced than that of total sphingomyelin (Figure 3A). Given that VLCFA-SMs have been suggested to undergo distinct trafficking during recycling from endosomes to the plasma membrane (Koivusalo et al. Mol Biol Cell 2007), their reduction may lead to decreased plasma membrane sphingomyelin content by altering its intracellular distribution. We have added this discussion to the revised manuscript.

      In Figure 3F, it is recommended to assess housekeeping gene expression as a control. Quantitative real-time PCR should be performed, and the average values from at least three biological replicates should be presented.

      We have performed quantitative RT-PCR analysis. The average values from at least three independent biological replicates are presented in Figure 3G.

      For Figure 3F, to show whether the reduction of CERS3 or ELOVL7 affects the Chol/SM balance and ABCA1 expression, it is necessary to investigate the phenotypes following the knockdown or knockout of these enzymes.

      We fully agree that phenotypic analyses of epithelial cells lacking CerS3 or ELOVL7 would provide valuable insights. However, we consider such investigations to be beyond the scope of the present study and plan to pursue them in future work.

      Clarifying whether similar phenotypes are induced by other EMT-related transcription factors, or if they are specific to Snail, would be beneficial.

      We agree that examining whether similar phenotypes are induced by other EMT-related transcription factors would be highly valuable for understanding the broader EMT network. However, as the focus of the present study is on lipid metabolic alterations associated with EMT—particularly the imbalance between sphingomyelin and cholesterol—we consider this investigation to be beyond the scope of the current work and plan to address it in future studies.

      There are errors in figure citations within the text that need correction:

      p.9 l.18 Fig. 3D → Fig. 3G

      p.9 l.22 Fig. 3I → Fig. 3H

      p.9 l.23 Fig. S2 → Fig. S4

      p.10 l.6 Fig. 3J → Fig. 1J

      p.10 l.8 Fig. 3J → Fig. 1J

      p.10 l.9 Fig. 3K → Fig. 3I

      p.10 l.12 Fig. 3H → Fig. 3J

      p.10 l.14 Fig. 2D and Fig. S4 → Fig. 2G and Fig. S4D

      We thank the reviewer for carefully pointing out these citation errors. We have corrected all figure references in the text as suggested.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Summary: 

      This study builds off prior work that focused on the molecule AA147 and its role as an activator of the ATF6 arm of the unfolded protein response. In prior manuscripts, AA147 was shown to enter the ER, covalently modify a subset of protein disulfide isomerases (PDIs), and improve ER quality control for the disease-associated mutants of AAT and GABAA. Unsuccessful attempts to improve the potency of AA147 have led the authors to characterize a second hit from the screen in this study: the phenylhydrazone compound AA263. The focus of this study on enhancing the biological activity of the AA147 molecule is compelling, and overcomes a hurdle of the prior AA147 drug that proved difficult to modify. The study successfully identifies PDIs as a shared cellular target of AA263 and its analogs. The authors infer, based on the similar target hits previously characterized for AA147, that PDI modification accounts for a mechanism of action for AA263. 

      Strengths: 

      The authors are able to establish that, like AA147, AA263 covalently targets ER PDIs. The work establishes the ability to modify the AA263 molecule to create analogs with more potency and efficacy for ATF6 activation. The "next generation" analogs are able to enhance the levels of functional AAT and GABAA receptors in cellular models expressing the Z-variant of AAT or an epilepsy-associated variant of the GABAA receptor, outlining the therapeutic potential for this molecule and laying the foundation for future organism-based studies. 

      We thank the reviewer for the positive comments on our manuscript. We address the reviewers remaining comments on our work, as described below.

      Weaknesses: 

      Arguably, the work does not fully support the statement provided in the abstract that the study "reveals a molecular mechanism for the activation of ATF6". The identification of targets of AA263 and its analogs is clear. However, it is a presumption that the overlap in PDIs as targets of both AA263 and AA147 means that AA263 works through the PDIs. While a likely mechanism, this conclusion would be bolstered by establishing that knockdown of the PDIs lessens drug impact with respect to ATF6 activation. 

      We thank the reviewer for this comment. We previously showed that genetic depletion of different PDIs modestly impacts ATF6 activation afforded by ATF6 activating compound such as AA147 (see Paxman et al (2018) ELIFE). However, as discussed in this manuscript, the ability for AA147 and AA263 to activate ATF6 signaling is mediated through polypharmacologic targeting of multiple different PDIs involved in regulating the redox state of ATF6. Thus, individual knockdowns are predicted to only minimally impact the ability for AA263 and its analogs to activate ATF6 signaling. 

      To address this comment, we have tempered our language regarding the mechanism of AA263-dependent ATF6 activation through PDI targeting described herein to better reflect the fact that we have not explicitly proven that PDI targeting is responsible for this activity, as highlighted below:

      “Page 7, Line 158: “Intriguingly, 12 proteins were shared between these two conditions, including 7 different ER-localized PDIs (Fig. 1H). This includes PDIs previously shown to regulate ATF6 activation including TXNDC12/ERP18.[45,46] These results are similar to those observed when comparing proteins modified by the selective ATF6 activating compound AA147<sup>yne</sup> and AA132<sup>yne</sup>.[38] Further, we found that the extent of labeling for PDIs including PDIA1, PDIA4, PDIA6, and TMX1, but not TXNDC12, showed greater modification by AA132<sup>yne</sup>, as compared to AA263<sup>yne</sup> (Fig. 1I). Similar results were observed for AA147<sup>yne</sup>.[38] This suggests that, like AA147, the selective activation of ATF6 afforded by AA263 is likely attributed to the modifications of a subset of multiple different ER-localized PDIs by this compound.”

      Alternatively, it has previously been suggested that the cell-type dependent activity of AA263 may be traced to the presence of cell-type specific P450s that allow for the metabolic activation of AA263 or cell-type specific PDIs (Plate et al 2016; Paxman et al 2018). If the PDI target profile is distinct in different cell types, and these target difference correlates with ATF6-induced activity by AA263, that would also bolster the authors' conclusion. 

      As highlighted by the reviewer, different ER oxidases (e.g., P450s) could differentially influence activation of compounds such as AA263 to promote PDI modification and subsequent ATF6 activation. The specific ER oxidases responsible for AA263 activation are currently unknown; however, we anticipate that multiple different enzymes can promote this activity making it difficult to discern the specific contributions of any one oxidase. We have made this point clearer in the revised submission, as below:

      Page 7, Line 169: “This specificity for ER proteins instead suggests the localized generation of AA263 quinone methides at the ER membrane, likely through metabolic activation by different ER localized oxidases, which has been previously been shown to contribute to the selective modification of ER proteins afforded by other compounds such as AA147 [49]”   

      Reviewer #2 (Public review):

      Modulating the UPR by pharmacological targeting of its sensors (or regulators) provides mostly uncharted opportunities in diseases associated with protein misfolding in the secretory pathway. Spearheaded by the Kelly and Wiseman labs, ATF6 modulators were developed in previous years that act on ER PDIs as regulators of ATF6. However, hurdles in their medicinal chemistry have hampered further development. In this study, the authors provide evidence that the small molecule AA263 also targets and covalently modifies ER PDIs, with the effect of activating ATF6. Importantly, AA263 turned out to be amenable to chemical optimization while maintaining its desired activity. Building on this, the authors show that AA263 derivatives can improve the aggregation, trafficking, and function of two disease-associated mutants of secretory pathway proteins. Together, this study provides compelling evidence for AA263 (and its derivatives) being interesting modulators of ER proteostasis. Mechanistic details of its mode of action will need more attention in future studies that can now build on this.

      We thank the reviewer for their positive comments on our manuscript. We address the reviewer’s specific queries on our work, as outlined below. 

      In detail, the authors provide strong evidence that AA263 covalently binds to ER PDIs, which will inhibit the protein disulfide isomerase activity. ER PDIs regulate ATF6, and thus their finding provides a mechanistic interpretation of AA263 activating the UPR. It should be noted, however, that AA263 shows broad protein labeling (Figure 1G), which may suggest additional targets, beyond the ones defined as MS hits in this study. 

      This is true. We do show broad proteome-wide labeling with AA263<sup>yne</sup>, which are largely reflected in the hits identified by MS beyond PDI family members. It is possible that other observed engaged targets, in addition to PDIs, may contribute to the activation of ATF6 signaling. Regardless, our MS analysis clearly shows that the compounds modified by AA263 are enriched for PDIs, further supporting our model whereby AA263-dependent PDI modification is likely responsible for ATF6 activation. 

      Also, a further direct analysis of the IRE1 and PERK pathways (activated or not by AA263) would have been a benefit, as e.g., PDIA1, a target of AA263, directly regulates IRE1 (Yu et al., EMBOJ, 2020), and other PDIs also act on PERK and IRE1. The authors interpret modest activation of IRE1/PERK target genes (Figure 2C) as an effect on target gene overlap, indeed the most likely explanation based on their selective analyses on IRE1 (ERdj4) and PERK (CHOP) downstream genes, but direct activation due to the targeting of their PDI regulators is also a possible explanation. 

      While we do observe mild increases in IRE1/XBP1s target genes, we do not observe significant increases in PERK/ISR target genes in cells treated with optimized AA263 analogs (see Fig. 2C). We previously showed that genetic ATF6 activation leads to a modest increase in IRE1/XBP1s target genes, reflecting the overlap in target genes of the IRE1/XBP1s and ATF6 pathways (see Shoulders et al (2013) Cell Reports). However, with our data, we cannot explicitly rule out the possibility that the mild increase in IRE1/XBP1s target genes reflects direct IRE1/XBP1s activation, as suggested by the reviewer. To address this, we have adapted the text to highlight this point, now specifically referring to preferential ATF6 activation afforded by these compounds, as below:

      Page 5, Line 100: “In addition to finding AA147, our original high-throughput screen also identified the phenylhydrazone compound AA263 as a compound that preferentially activates the ATF6 arm of the UPR [26]”  

      Further key findings of this paper are the observed improvement of AAT behavior and GABAA trafficking and function. Further strength to the mechanistic conclusion that ATF6 activation causes this could be obtained by using ATF6 inhibitors/knockouts in the presence of AA263 (as the target PDIs may directly modulate the behavior of AAT and/or GABAA). 

      AA263 and related compounds could influence ER proteostasis of destabilized proteins through multiple mechanisms including ATF6 activation or direct modification of a subset of PDIs. We previously showed that AA263-dependent enhancement of A1AT-Z secretion and activity can be largely attributed to ATF6 activation (see Sun et al (2023) Cell Chem Biol). In the revised submission, we now show that increased levels of g2(R177G) afforded by treatment with AA263<sup>yne</sup> are partially blocked by co-treatment with the ATF6 inhibitor Ceapin-A7 (CP7), highlighting the contributions of ATF6 activation for this phenotype (Fig. S5B,C). Intriguingly, this result also demonstrates the benefit for targeting ER proteostasis using compounds such as our optimized AA263 analogs, as this approach allows us to enhance ER proteostasis of destabilized proteins through multiple mechanisms. We further expand on this specific point in the revised manuscript as below:

      Page 14, Line 375: “AA263 and its related analogs can influence ER proteostasis in these models through different mechanisms including ATF6-dependent remodeling of ER proteostasis and direct alterations to the activity of specific PDIs.(*) Consistent with this, we show that pharmacologic inhibition of ATF6 only partially blocks increases of g2(R177G) afforded by treatment with AA263<sup>yne</sup>, highlighting the benefit for targeting multiple aspects of ER proteostasis to enhance ER proteostasis of this diseaserelevant GABA<sub>A</sub> variant. While additional studies are required to further deconvolute the relative contributions of these two mechanisms on the protection afforded by our optimized compounds, our results demonstrate the potential for these compounds to enhance ER proteostasis in the context of different protein misfolding diseases.”  

      Along the same line, it also warrants further investigation why the different compounds, even if all were used at concentrations above their EC50, had different rescuing capacities on the clients.

      This is an interesting question that we are continuing to study. While in general, we observe fairly good correlation between ATF6 activation and correction of diseases of ER proteostasis linked to proteins such as A1AT-Z or GABA<sub>A</sub> receptors, as the reviewer points out, we do find some compounds are more efficient at correcting proteostasis than others activate ATF6 to similar levels. We attribute this to differences in either labeling efficiency of PDIs or differential regulation of various ER proteostasis factors, although that remains to be further defined. As we continue working with these (and other) compounds, we will focus on defining a more molecular basis for these findings. 

      Together, the study now provides a strong basis for such in-depth mechanistic analyses.

      We agree and we are continuing to pursue the mechanistic basis of ER proteostasis remodeling afforded by these and related compounds. 

      Reviewer #3 (Public review):

      Summary: 

      This study aims to develop and characterize phenylhydrazone-based small molecules that selectively activate the ATF6 arm of the unfolded protein response by covalently modifying a subset of ER-resident PDIs. The authors identify AA263 as a lead scaffold and optimize its structure to generate analogs with improved potency and ATF6 selectivity, notably AA263-20. These compounds are shown to restore proteostasis and functional expression of disease-associated misfolded proteins in cellular models involving both secretory (AAT-Z) and membrane (GABAA receptor) proteins. The findings provide valuable chemical tools for modulating ER proteostasis and may serve as promising leads for therapeutic development targeting protein misfolding diseases.

      Strengths: 

      (1) The study presents a well-defined chemical biology framework integrating proteomics, transcriptomics, and disease-relevant functional assays. 

      (2) Identification and optimization of a new electrophilic scaffold (AA263) that selectively activates ATF6 represents a valuable advance in UPR-targeted pharmacology.

      (3) SAR studies are comprehensive and logically drive the development of more potent and selective analogs such as AA263-20.

      (4) Functional rescue is demonstrated in two mechanistically distinct disease models of protein misfolding-one involving a secretory protein and the other a membrane protein-underscoring the translational relevance of the approach. 

      We thank the reviewer for their positive comments related to our work. We address specific weaknesses highlighted by the reviewer, as outlined below. 

      Weaknesses: 

      (1) ATF6 activation is primarily inferred from reporter assays and transcriptional profiling; however, direct evidence of ATF6 cleavage is lacking.

      While ATF6 trafficking and processing can be visualized in cell culture models following severe ER insults (e.g., Tg, Tm), we showed previously that the more modest activation afforded by pharmacologic activators such as AA147 and AA263 cannot be easily visualized by monitoring ATF6 processing (see Plate et al (2016) ELIFE). As we have shown in numerous other manuscripts, we have established a transcriptional profiling approach that accurately defines ATF6 activation. We use that approach to confirm preferential ATF6 activation in this manuscript. We feel that this is sufficient for confirming ATF6 activation. However, we also now include data showing that co-treatment with ATF6 inhibitors (e.g., CP7) blocks increased expression of ATF6 target genes induced by our prioritized compound AA263<sup>yne</sup> (Fig. S1B). This further supports our assertion that this compound activates ATF6 signaling.  

      (2) While the mechanism involving PDI modification and ATF6 activation is plausible, it remains incompletely characterized. 

      We thank the reviewer for this comment. We previously showed that genetic depletion of different PDIs modestly impacts ATF6 activation afforded by ATF6 activating compound such as AA147. However, as discussed in this manuscript, the ability for AA147 and AA263 to activate ATF6 signaling is mediated through polypharmacologic targeting of multiple different PDIs involved in regulating ATF6 redox. Thus, individual knockdowns are predicted to only minimally impact the ability for AA263 and its analogs to activate ATF6 signaling. 

      To address this comment, we have tempered out language regarding the mechanism of AA263-dependent ATF6 activation through PDI targeting described herein to better reflect the fact that we have not explicitly proven that PDI targeting is responsible for this activity, as highlighted below:

      Page 7, Line 158: “Intriguingly, 12 proteins were shared between these two conditions, including 7 different ER-localized PDIs (Fig. 1H). This includes PDIs previously shown to regulate ATF6 activation including TXNDC12/ERP18.[45,46] These results are similar to those observed when comparing proteins modified by the selective ATF6 activating compound AA147<sup>yne</sup> and AA132<sup>yne</sup>.[38] Further, we found that the extent of labeling for PDIs including PDIA1, PDIA4, PDIA6, and TMX1, but not TXNDC12, showed greater modification by AA132<sup>yne</sup>, as compared to AA263<sup>yne</sup> (Fig. 1I). Similar results were observed for AA147<sup>yne</sup>[38] This suggests that, like AA147, the selective activation of ATF6 afforded by AA263 is likely attributed to the modifications of a subset of multiple different ER-localized PDIs by this compound.”

      (3) No in vivo data are provided, leaving the pharmacological feasibility and bioavailability of these compounds in physiological systems unaddressed.

      We are continuing to test the in vivo activity of these compounds in work outside the scope of this initial study. 

      Reviewer #1 (Recommendations for the authors): 

      (1) First page of the discussion, last sentence. "We previously showed the relatively labeling of PDI modification directly impacts..." should be reworded.

      Thank you. We have corrected this in the revised manuscript. 

      (2) What is the rationale for measuring ERSE-Fluc activity at 18 h but RNAseq at 6 h? What is known about the timing of action for AA263?

      Compound-dependent activation of luciferase reporters requires the translation and accumulation of the luciferase protein for sufficient signal, while qPCR does not. We normally use longer incubations for reporter assays to ensure that we have sufficient quantity of reporter protein to accurately monitor activation. We have found that AA263 can rapidly increase ATF6 activity, with gene expression increases being observed after only a few hours of treatment. This is consistent with the proposed mechanism of ATF6 activation discussed herein involving metabolic activation and subsequent PDI modification.   

      (3) Figure 1 panel E and Figure S2 panel B. Are these the same data for AA263 and AA263yne, with the AA2635 added to the plot for Figure S2? If so, it would be nice to note that panel B represents data from 3 of the replicates that are shown in Figure 1 (n=6).

      Yes. The AA263 and AA263<sup>yne</sup> data shown in Fig. 1E and Fig. S2B are the same data, as these experiments were performed at the same time. We apologize for this oversight, which has now been corrected in the revised version. Note that there were n=3 replicates for the dose response shown in Fig. 1E, which we corrected in the figure legend as below:

      Fig. S2B Figure Legend: “B. Activation of the ERSE-FLuc ATF6 reporter in HEK293T cells treated for 18 h with the indicated concentration of AA263, AA263<sup>yne</sup>, or AA263-5. Error bars show SEM for n= 3 replicates. The data for AA263 and AA263<sup>yne</sup> is the same as that shown in Fig. 1E and are shown for comparison.” 

      (4) Figure S3. The legend notes 5 µM AA263-yne and 20 µM analog, whereas the figure itself outlines the same ratio but different concentrations: 10 µM and 40 µM.

      We apologize for this mistake in the legend, which has been corrected. The information in the figure is correct. 

      Reviewer #2 (Recommendations for the authors): 

      (1) The activation mechanism of ATF6 is still debated (really trafficking as a monomer?); the authors may want to word more carefully here. 

      We agree. We have corrected this in the revised manuscript to indicate that increased populations of reduced ATF6 traffic for proteolytic processing. 

      (2) In Figure 1B, below the figure, mM is written for BME, but micromolar is meant.

      Thank you. This has been corrected in the revised manuscript. 

      (3) The authors may want to make clearer, why BME does not completely inhibit AA263 and does not cause ER stress itself under the conditions tested.

      The addition of BME in our experiments is designed to shift the redox potential of the cell to increase intracellular thiol reagents, such as glutathione, that can quench ‘activated’ AA263 and its analogs. However, BME is actively being oxidized upon addition and the intracellular redox environment can rapidly equilibrate following BME addition. Thus, we do not expect that AA263 or other metabolically activated compounds will be fully quenched using this approach, as is observed. This is consistent with other experiments where we show that the use of these types of reducing agents do not fully suppress the activity of reactive molecules, instead shifting their dosedependent activation of specific pathways.  

      (4) The data in Figure 4C seems to disagree with the other data on the tested compounds; this should be clarified. 

      It is unclear to what the reviewer is referring. The data in 4C shows that treatment with our optimized AA263 analogs improved elastase inhibition afforded by secreted A1AT, as would be predicted. 

      (5) PDIs that have been shown to regulate ATF6 should be discussed in more detail in the light of the presented data/interactome (e.g., ERp18).

      Thank you for the suggestion. We now explicitly note that AA263<sup>yne</sup> covalent modifies TXNDC12/ERP18 in our proteomic dataset. However, we also note that there is no difference in labeling of this specific PDI between AA263<sup>yne</sup> and AA132<sup>yne</sup>. This may indicate that the targeting of this protein is responsible for the larger levels of ATF6 activation afforded by both these compounds relative to AA147, with the activation of other UPR pathways afforded by AA132 resulting from increased labeling of other PDIs. We are now exploring this possibility in work outside the scope of this current manuscript. 

      Page 7 Line 158: “Intriguingly, 12 proteins were shared between these two conditions, including 7 different ER-localized PDIs (Fig. 1H). This includes PDIs previously shown to regulate ATF6 activation including TXNDC12/ERP18.[45,46] These results are similar to those observed when comparing proteins modified by the selective ATF6 activating compound AA147<sup>yne</sup> and AA132<sup>yne</sup>.[38] Further, we found that the extent of labeling for PDIs including PDIA1, PDIA4, PDIA6, and TMX1, but not TXNDC12, showed greater modification by AA132<sup>yne</sup>, as compared to AA263<sup>yne</sup> (Fig. 1I). Similar results were observed for AA147<sup>yne</sup> [38] This suggests that, like AA147, the selective activation of ATF6 afforded by AA263 is likely attributed to the modifications of a subset of multiple different ER-localized PDIs by this compound.”

      Reviewer #3 (Recommendations for the authors):

      (1) Please consider adding detection of ATF6 cleavage by Western blot as direct evidence of AA263-induced ATF6 activation, to substantiate the central mechanistic claim.

      While ATF6 trafficking and processing can be visualized in cell culture models following severe ER insults (e.g., Tg, Tm), we showed previously that the more modest activation afforded by pharmacologic activators such as AA147 and AA263 cannot be easily visualized through monitoring ATF6 proteolytic processing by western blotting (see Plate et al (2016) ELIFE). As we have shown in numerous other manuscripts, we have established a transcriptional profiling approach that accurately defines ATF6 activation. We use that approach to confirm preferential ATF6 activation in this manuscript. We feel that this is sufficient for confirming ATF6 activation. However, we also now include qPCR data showing that co-treatment with ATF6 inhibitors (e.g., CP7) blocks increased expression of ATF6 target genes induced by our prioritized compounds. 

      (2) To strengthen causal inference, loss-of-function experiments such as PDI knockdown, cysteine mutant inactivation, or reconstitution studies may be informative.

      We thank the reviewer for this comment. We previously showed that genetic depletion of different PDIs modestly impacts ATF6 activation afforded by ATF6 activating compound such as AA147. However, as discussed in this manuscript, the ability for AA147 and AA263 to activate ATF6 signaling is mediated through polypharmacologic targeting of multiple different PDIs involved in regulating ATF6 redox state rather than a single PDI family member. Thus, individual knockdowns are predicted to only minimally impact the ability for AA263 and its analogs to activate ATF6 signaling. 

      To address this comment, we have tempered out language regarding the mechanism of AA263-dependent ATF6 activation through PDI targeting described herein to better reflect the fact that we have not explicitly proven that PDI targeting is responsible for this activity.

      (3) Since β-mercaptoethanol inhibits ATF6 activation, it would be helpful to examine whether DTT also suppresses the activity of AA263 or its analogs, to clarify the redox sensitivity of the mechanism.

      The use of reducing agents stronger than BME, such as DTT, globally activates the UPR, including the ATF6 arm of the UPR. Thus, we are unable to perform the requested experiments. We specifically use BME because it is a sufficiently mild reducing agent that can quench reactive metabolites (e.g., activated AA263 analogs) through alterations in cellular glutathione levels without globally activating the UPR.  

      (4) Given the electrophilic nature of AA263, which may allow it to react with endogenous thiols (e.g., glutathione or cysteine), a brief discussion or experimental validation of this potential liability would enhance the interpretation of in vivo applicability.

      Metabolically activated AA263, like AA147, can be quenched by endogenous thiols such as glutathione. However, treatment with our metabolically activatable electrophiles AA147 and AA263 , either in vitro or in vivo, does not seem to induce activation of the NRF2-regulated oxidative stress response (OSR) in the cell lines used in this manuscript (e.g., Fig. S2C). This suggests that treatment with these compounds does not globally disrupt the intracellular redox state, at least in the tested cell lines. While AA147 has been shown to activate NRF2 in specifical neuronal cell lines and in primary neurons, AA147 does not activate NRF2 signaling in other nonneuronal cell lines or other tissues (see Rosarda et al (2021) ACS Chem Bio). We are currently testing the potential for AA263 to similarly activate adaptive NRF2 signaling in neuronal cells. Regardless, AA147, which functions through a similar mechanism to that proposed for AA263, has been shown to be beneficial in multiple models of disease both in vitro and in vivo. This indicates that this mechanism of action is suitable for continued translational development to mitigate pathologic ER proteostasis disruption observed in diverse types of human disease.  

      (5) Evaluation of in vivo activity, such as BiP induction in the liver following intraperitoneal administration of AA263-20 or related analogs, could substantially increase the translational impact of the work.

      We are continuing to probe the activity of our optimized AA263 analogs in vivo in work outside the scope of this current manuscript. We thank the reviewer for this suggestion. 

      (6) The degree of BiP induction may also be contextualized by comparison with known ER stress inducers such as thapsigargin or tunicamycin, ideally by providing relative dose-equivalent responses.

      We are not sure to what the reviewer is referring. We show comparative activation of ATF6 in cells treated with the ER stressor Tg and our compounds by both reporter assay (e.g., Fig. 2B) and qPCR of the ATF6 target gene BiP (HSPA5) (Fig. S2A). We feel that this provides context for the more physiologic levels of ATF6 activation afforded by these compounds.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      This paper presents a computational model of the evolution of two different kinds of helping ("work," presumably denoting provisioning, and defense tasks) in a model inspired by cooperatively breeding vertebrates. The helpers in this model are a mix of previous offspring of the breeder and floaters that might have joined the group, and can either transition between the tasks as they age or not. The two types of help have differential costs: "work" reduces "dominance value," (DV), a measure of competitiveness for breeding spots, which otherwise goes up linearly with age, but defense reduces survival probability. Both eventually might preclude the helper from becoming a breeder and reproducing. How much the helpers help, and which tasks (and whether they transition or not), as well as their propensity to disperse, are all evolving quantities. The authors consider three main scenarios: one where relatedness emerges from the model, but there is no benefit to living in groups, one where there is no relatedness, but living in larger groups gives a survival benefit (group augmentation, GA), and one where both effects operate. The main claim is that evolving defensive help or division of labor requires the group augmentation; it doesn't evolve through kin selection alone in the authors' simulations.

      This is an interesting model, and there is much to like about the complexity that is built in. Individual-based simulations like this can be a valuable tool to explore the complex interaction of life history and social traits. Yet, models like this also have to take care of both being very clear on their construction and exploring how some of the ancillary but potentially consequential assumptions affect the results, including robust exploration of the parameter space. I think the current manuscript falls short in these areas, and therefore, I am not yet convinced of the results. In this round, the authors provided some clarity, but some questions still remain, and I remain unconvinced by a main assumption that was not addressed.

      Based on the authors' response, if I understand the life history correctly, dispersers either immediately join another group (with 1-the probability of dispersing), or remain floaters until they successfully compete for a breeder spot or die? Is that correct? I honestly cannot decide because this seems implicit in the first response but the response to my second point raises the possibility of not working while floating but can work if they later join a group as a subordinate. If it is the case that floaters can have multiple opportunities to join groups as subordinates (not as breeders; I assume that this is the case for breeding competition), this should be stated, and more details about how. So there is still some clarification to be done, and more to the point, the clarification that happened only happened in the response. The authors should add these details to the main text. Currently, the main text only says vaguely that joining a group after dispersing " is also controlled by the same genetic dispersal predisposition" without saying how.

      In each breeding cycle, individuals have the opportunity to become a breeder, a helper, or a floater. Social role is really just a state, and that state can change in each breeding cycle (see Figure 1). Therefore, floaters may join a group as subordinates at any point in time depending on their dispersal propensity, and subordinates may also disperse from their natal group any given time. In the “Dominance-dependent dispersal propensities” section in the SI, this dispersal or philopatric tendency varies with dominance rank.

      We have added: “In each breeding cycle” (L415) to clarify this further.

      In response to my query about the reasonableness of the assumption that floaters are in better condition (in the KS treatment) because they don't do any work, the authors have done some additional modeling but I fail to see how that addresses my point. The additional simulations do not touch the feature I was commenting on, and arguably make it stronger (since assuming a positive beta_r -which btw is listed as 0 in Table 1- would make floaters on average be even more stronger than subordinates). It also again confuses me with regard to the previous point, since it implies that now dispersal is also potentially a lifetime event. Is that true?

      We are not quite sure where the reviewer gets this idea because we have never assumed a competitive advantage of floaters versus helpers. As stated in the previous revision, floaters can potentially outcompete subordinates of the same age if they attempt to breed without first queuing as a subordinate (step 5 in Figure 1) if subordinates are engaged in work tasks. However, floaters also have higher mortality rates than group members, which makes them have lower age averages. In addition, helpers have the advantage of always competing for an open breeding position in the group, while floaters do not have this preferential access (in Figure S2 we reduce even further the likelihood of a floater to try to compete for a breeding position).

      Moreover, in the previous revision (section: “Dominance-dependent dispersal propensities” in the SI) we specifically addressed this concern by adding the possibility that individuals, either floaters or subordinate group members, react to their rank or dominance value to decide whether to disperse (if subordinate) or join a group (if floater). Hence, individuals may choose to disperse when low ranked and then remain on the territory they dispersed to as helpers, OR they may remain as helpers in their natal territory as low ranked individuals and then disperse later when they attain a higher dominance value. The new implementation, therefore, allows individuals to choose when to become floaters or helpers depending on their dominance value. This change to the model affects the relative competitiveness between floaters and helpers, which avoids the assumption that either low- or high-quality individuals are the dispersing phenotype and, instead, allows rank-based dispersal as an emergent trait. As shown in Figure S5, this change had no qualitative impact on the results.

      To make this all clearer, we have now added to all of the relevant SI tables a new row with the relative rank of helpers vs floaters. As shown, floaters do not consistently outrank helpers. Rather, which role is most dominant depends on the environment and fitness trade-offs that shape their dispersing and helping decisions.

      Some further clarifications: beta_r is a gene that may evolve either positive or negative values, 0 (no reaction norm of dispersal to dominance rank) is the initial value in the simulations before evolution takes place. Therefore, this value may evolve to positive or negative values depending on evolutionary trade-offs. Also, and as clarified in the previous comment, the decision to disperse or not occurs at each breeding cycle, so becoming a floater, for example, is not a lifetime event unless they evolve a fixed strategy (dispersal = 0 or 1). 

      Meanwhile, the simplest and most convincing robustness check, which I had suggested last round, is not done: simply reduce the increase in the R of the floater by age relative to subordinates. I suspect this will actually change the results. It seems fairly transparent to me that an average floater in the KS scenario will have R about 15-20% higher than the subordinates (given no defense evolves, y_h=0.1 and H_work evolves to be around 5, and the average lifespan for both floaters and subordinates are in the range of 3.7-2.5 roughly, depending on m). That could be a substantial advantage in competition for breeding spots, depending on how that scramble competition actually works. I asked about this function in the last round (how non-linear is it?) but the authors seem to have neglected to answer.

      As we mentioned in the previous comment above, we have now added the relative rank between helpers and floaters to all the relevant SI tables, to provide a better idea of the relative competitiveness of residents versus dispersers for each parameter combination. As seen in Table S1, the competitive advantage of floaters is only marginally in the favor for floaters in the “Only kin selection” implementation. This advantage only becomes more pronounced when individuals can choose whether to disperse or remain philopatric depending on their rank. In this case, the difference in rank between helpers and floaters is driven by the high levels of dispersal, with only a few newborns (low rank) remaining briefly in the natal territory (Table S6). Instead, the high dispersal rates observed under the “Only kin selection” scenario appear to result from the low incentives to remain in the group when direct fitness benefits are absent, unless indirect fitness benefits are substantially increased. This effect is reinforced by the need for task partitioning to occur in an all-or-nothing manner (see the new implementation added to the “Kin selection and the evolution of division of labor” in the Supplementary materials; more details in following comments).

      In addition, we specifically chose not to impose this constraint of forcing floaters to be lower rank than helpers because doing so would require strong assumptions on how the floaters rank is determined. These assumptions are unlikely to be universally valid across natural populations (and probably not commonly met in most species) and could vary considerably among species. Therefore, it would add complexity to the model while reducing generalizability.

      As stated in the previous revision, no scramble competition takes place, this was an implementation not included in the final version of the manuscript in which age did not have an influence in dominance. Results were equivalent and we decided to remove it for simplicity prior to the original submission, as the model is already very complex in the current stage; we simply forgot to remove it from Table 1, something we explained in the previous round of revisions.

      More generally, I find that the assumption (and it is an assumption) floaters are better off than subordinates in a territory to be still questionable. There is no attempt to justify this with any data, and any data I can find points the other way (though typically they compare breeders and floaters, e.g.: https://bioone.org/journals/ardeola/volume-63/issue-1/arla.63.1.2016.rp3/The-Unknown-Life-of-Floaters--The-Hidden-Face-of/10.13157/arla.63.1.2016.rp3.full concludes "the current preliminary consensus is that floaters are 'making the best of a bad job'."). I think if the authors really want to assume that floaters have higher dominance than subordinates, they should justify it. This is driving at least one and possibly most of the key results, since it affects the reproductive value of subordinates (and therefore the costs of helping).

      We explicitly addressed this in the previous revision in a long response about resource holding potential (RHP). Once again, we do NOT assume that dispersers are at a competitive advantage to anyone else. Floaters lack access to a territory unless they either disperse into an established group or colonize an unoccupied territory. Therefore, floaters endure higher mortalities due to the lack of access to territories and group living benefits in the model, and are not always able to try to compete for a breeding position.

      The literature reports mixed evidence regarding the quality of dispersing individuals, with some studies identifying them as low-quality and others as high-quality, attributing this to them experiencing fewer constraints when dispersing that their counterparts (e.g. Stiver et al. 2007 Molecular Ecology; Torrents‐Ticó, et al. 2018 Journal of Zoology). Additionally, dispersal can provide end-of-queue individuals in their natal group an opportunity to join a queue elsewhere that offers better prospects, outcompeting current group members (Nelson‐Flower et al. 2018 Journal of Animal Ecology). Moreover, in our model floaters do not consistently have lower dominance values or ranks than helpers, and dominance value is often only marginally different.

      In short, we previously addressed the concern regarding the relative competitiveness of floaters compared to subordinate group members. To further clarify this point here, we have now included additional data on relative rank in all of the relevant SI tables. We hope that these additions will help alleviate any remaining concerns on this matter.

      Regarding division of labor, I think I was not clear so will try again. The authors assume that the group reproduction is 1+H_total/(1+H_total), where H_total is the sum of all the defense and work help, but with the proviso that if one of the totals is higher than "H_max", the average of the two totals (plus k_m, but that's set to a low value, so we can ignore it), it is replaced by that. That means, for example, if total "work" help is 10 and "defense" help is 0, total help is given by 5 (well, 5.1 but will ignore k_m). That's what I meant by "marginal benefit of help is only reduced by a half" last round, since in this scenario, adding 1 to work help would make total help go to 5.5 vs. adding 1 to defense help which would make it go to 6. That is a pretty weak form of modeling "both types of tasks are necessary to successfully produce offspring" as the newly added passage says (which I agree with), since if you were getting no defense by a lot of food, adding more food should plausibly have no effect on your production whatsoever (not just half of adding a little defense). This probably explains why often the "division of labor" condition isn't that different than the no DoL condition.

      The model incorporates division of labor as the optimal strategy for maximizing breeder productivity, while penalizing helping efforts that are limited to either work or defense alone. Because the model does not intend to force the evolution of help as an obligatory trait (breeders may still reproduce in the absence of help; k<sub>0</sub> ≠ 0), we assume that the performance of both types of task by the helpers is a non-obligatory trait that complements parental care.

      That said, we recognize the reviewer’s concern that the selective forces modeled for division of labor might not be sufficient in the current simulations. To address this, we have now introduced a new implementation, as discussed in the “Kin selection and the evolution of division of labor” section in the SI. In this implementation, division of labor becomes obligatory for breeders to gain a productivity boost from the help of subordinate group members. The new implementation tests whether division of labor can arise solely from kin selection benefits. Under these premises, philopatry and division of labor do emerge through kin selection, but only when there is a tenfold increase in productivity per unit of help compared to the default implementation. Thus, even if such increases are biologically plausible, they are more likely to reflect the magnitudes characteristic of eusocial insects rather than of cooperatively breeding vertebrates (the primary focus of this model). Such extreme requirements for productivity gains and need for coordination further suggest that group augmentation, and not kin selection, is probably the primary driving force particularly in harsh environments. This is now discussed in L210-213.

      Reviewer #2 (Public review):

      Summary:

      This paper formulates an individual-based model to understand the evolution of division of labor in vertebrates. The model considers a population subdivided in groups, each group has a single asexually-reproducing breeder, other group members (subordinates) can perform two types of tasks called "work" or "defense", individuals have different ages, individuals can disperse between groups, each individual has a dominance rank that increases with age, and upon death of the breeder a new breeder is chosen among group members depending on their dominance. "Workers" pay a reproduction cost by having their dominance decreased, and "defenders" pay a survival cost. Every group member receives a survival benefit with increasing group size. There are 6 genetic traits, each controlled by a single locus, that control propensities to help and disperse, and how task choice and dispersal relate to dominance. To study the effect of group augmentation without kin selection, the authors cross-foster individuals to eliminate relatedness. The paper allows for the evolution of the 6 genetic traits under some different parameter values to study the conditions under which division of labour evolves, defined as the occurrence of different subordinates performing "work" and "defense" tasks. The authors envision the model as one of vertebrate division of labor.

      The main conclusion of the paper is that group augmentation is the primary factor causing the evolution of vertebrate division of labor, rather than kin selection. This conclusion is drawn because, for the parameter values considered, when the benefit of group augmentation is set to zero, no division of labor evolves and all subordinates perform "work" tasks but no "defense" tasks.

      Strengths:

      The model incorporates various biologically realistic details, including the possibility to evolve age polytheism where individuals switch from "work" to "defence" tasks as they age or vice versa, as well as the possibility of comparing the action of group augmentation alone with that of kin selection alone.

      Weaknesses:

      The model and its analysis is limited, which makes the results insufficient to reach the main conclusion that group augmentation and not kin selection is the primary cause of the evolution of vertebrate division of labor. There are several reasons.

      First, the model strongly restricts the possibility that kin selection is relevant. The two tasks considered essentially differ only by whether they are costly for reproduction or survival. "Work" tasks are those costly for reproduction and "defense" tasks are those costly for survival. The two tasks provide the same benefits for reproduction (eqs. 4, 5) and survival (through group augmentation, eq. 3.1). So, whether one, the other, or both tasks evolve presumably only depends on which task is less costly, not really on which benefits it provides. As the two tasks give the same benefits, there is no possibility that the two tasks act synergistically, where performing one task increases a benefit (e.g., increasing someone's survival) that is going to be compounded by someone else performing the other task (e.g., increasing that someone's reproduction). So, there is very little scope for kin selection to cause the evolution of labour in this model. Note synergy between tasks is not something unusual in division of labour models, but is in fact a basic element in them, so excluding it from the start in the model and then making general claims about division of labour is unwarranted. I made this same point in my first review, although phrased differently, but it was left unaddressed.

      The scope of this paper was to study division of labor in cooperatively breeding species with fertile workers, in which help is exclusively directed towards breeders to enhance offspring production (i.e., alloparental care), as we stated in the previous review. Therefore, in this context, helpers may only obtain fitness benefits directly or indirectly by increasing the productivity of the breeders. This benefit is maximized when division of labor occurs between group members as there is a higher return for the least amount of effort per capita. Our focus is in line with previous work in most other social animals, including eusocial insects and humans, which emphasizes how division of labor maximizes group productivity. This is not to suggest that the model does not favor synergy, as engaging in two distinct tasks enhances the breeders' productivity more than if group members were to perform only one type of alloparental care task. We have expanded on the need for division of labor by making the performance of each type of task a requirement to boost the breeders productivity, see more details in a following comment.

      Second, the parameter space is very little explored. This is generally an issue when trying to make general claims from an individual-based model where only a very narrow parameter region has been explored of a necessarily particular model. However, in this paper, the issue is more evident. As in this model the two tasks ultimately only differ by their costs, the parameter values specifying their costs should be varied to determine their effects. Instead, the model sets a very low survival cost for work (yh=0.1) and a very high survival cost for defense (xh=3), the latter of which can be compensated by the benefit of group augmentation (xn=3). Some very limited variation of xh and xn is explored, always for very high values, effectively making defense unevolvable except if there is group augmentation. Hence, as I stated in my previous review, a more extensive parameter exploration addressing this should be included, but this has not been done. Consequently, the main conclusion that "division of labor" needs group augmentation is essentially enforced by the limited parameter exploration, in addition to the first reason above.

      We systematically explored the parameter landscape and report in the body of the paper only those ranges that lead to changes in the reaction norms of interest (other ranges are explored in the SI). When looking into the relative magnitude of cost of work and defense tasks, it is important to note that cost values are not directly comparable because they affect different traits. However, the ranges of values capture changes in the reaction norms that lead to rank-depending task specialization.

      To illustrate this more clearly, we have added a new section in the SI (Variation in the cost of work tasks instead of defense tasks section) showing variation in y<sub>h</sub>, which highlights how individuals trade off the relative costs of different tasks. As shown, the results remain consistent with everything we showed previously: a higher cost of work (high y<sub>h</sub>) shifts investment toward defense tasks, while a higher cost of defense (high x<sub>h</sub>) shifts investment toward work tasks.

      Importantly, additional parameter values were already included in the SI of the previous revision, specifically to favor the evolution of division of labor under only kin selection. Basically, division of labor under only kin selection does happen, but only under conditions that are very restrictive, as discussed in the “Kin selection and the evolution of division of labor” section in the SI. We have tried to make this point clearer now (see comments to previous reviewer above, and to this reviewer right below).

      Third, what is called "division of labor" here is an overinterpretation. When the two tasks evolve, what exists in the model is some individuals that do reproduction-costly tasks (so-called "work") and survival-costly tasks (so-called "defense"). However, there are really no two tasks that are being completed, in the sense that completing both tasks (e.g., work and defense) is not necessary to achieve a goal (e.g., reproduction). In this model there is only one task (reproduction, equation 4,5) to which both "tasks" contribute equally and so one task doesn't need to be completed if the other task compensates for it. So, this model does not actually consider division of labor.

      Although it is true that we did not make the evolution of help obligatory and, therefore, did not impose division of labor by definition, the assumptions of the model nonetheless create conditions that favor the emergence of division of labor. This is evident when comparing the equilibria between scenarios where division of labor was favored versus not favored (Figure 2 triangles vs circles).

      That said, we acknowledge the reviewer’s concern that the selective forces modeled in our simulations may not, on their own, be sufficient to drive the evolution of division of labor under only kin selection. Therefore, we have now added a section where we restrict the evolution of help to instances in which division of labor is necessary to have an impact on the dominant breeder productivity. Under this scenario, we do find division of labor (as well as philopatry) evolving under only kin selection. However, this behavior only evolves when help highly increases the breeders’ productivity (by a factor of 10 what is needed for the evolution of division of labor under group augmentation). Therefore, group augmentation still appears to be the primary driver of division of labor, while kin selection facilitates it and may, under certain restrictive circumstances, also promote division of labor independently (discussed in L210-213).

      Reviewer #1 (Recommendations for the authors):

      I really think you should do the simulations where floaters do not come out ahead by floating. That will likely change the result, but if it doesn't, you will have a more robust finding. If it does, then you will have understood the problem better.

      As we outlined in the previous round of revisions, implementing this change would be challenging without substantially increasing model complexity and reducing its general applicability, as it would require strong assumptions that could heavily influence dispersal decisions. For instance, by how much should helpers outcompete floaters? Would a floater be less competitive than a helper regardless of age, or only if age is equal? If competitiveness depends on equal age, what is the impact of performing work tasks given that workers always outcompete immigrants? Conversely, if floaters are less competitive regardless of age, is it realistic that a young individual would outcompete all immigrants? If a disperser finds a group immediately after dispersal versus floating for a while, is the dominance value reduced less (as would happen to individuals doing prospections before dispersal)? 

      Clearly it is not as simple as the referee suggests because there are many scenarios that would need to be considered and many assumptions made in doing this. As we explained to the points above, we think our treatment of floaters is consistent with the definition of floaters in the literature, and our model takes a general approach without making too many assumptions.

      Reviewer #2 (Recommendations for the authors):

      The paper's presentation is still unclear. A few instances include the following. It is unclear what is plotted in the vertical axes of Figure 2, which is T but T is a function of age t, so this T is presumably being plotted at a specific t but which one it is not said.

      The values graphed are the averages of the phenotypically expressed tasks, not the reaction norms per se. We have now rewritten the the axis to “Expressed task allocation T (0 = work, 1 = defense)” to increase clarity across the manuscript.

      The section titled "The need for division of labor" in the methods is still very unclear.

      We have rephased this whole section to improve clarity.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Nielsen et al have identified a new disease mechanism underlying hypoplastic left heart syndrome due to variants in ribosomal protein genes that lead to impaired cardiomyocyte proliferation. This detailed study starts with an elegant screen in stemcell-derived cardiomyocytes and whole genome sequencing of human patients and extends to careful functional analysis of RP gene variants in fly and fish models. Striking phenotypic rescue is seen by modulating known regulators of proliferation, including the p53 and Hippo pathways. Additional experiments suggest that the cell type specificity of the variants in these ubiquitously expressed genes may result from genetic interactions with cardiac transcription factors. This work positions RPs as important regulators of cardiomyocyte proliferation and differentiation involved in the etiology of HLHS, although the downstream mechanisms are unclear.

      We thank Reviewer 1 for the thoughtful assessment of our manuscript. Our point-bypoint responses to the recommendations are provided (Reviewer 1, “Recommendations for the authors”).

      Reviewer #2 (Public review):

      Tanja Nielsen et al. present a novel strategy for the identification of candidate genes in Congenital Heart Disease (CHD). Their methodology, which is based on comprehensive experiments across cell models, Drosophila and zebrafish models, represents an innovative, refreshing and very useful set of tools for the identification of disease genes, in a field which are struggling with exactly this problem. The authors have applied their methodology to investigate the pathomechanisms of Hypoplastic Left Heart Syndrome (HLHS) - a severe and rare subphenotype in the large spectrum of CHD malformations. Their data convincingly implicates ribosomal proteins (RPs) in growth and proliferation defects of cardiomyocytes, a mechanism which is suspected to be associated with HLHS.

      By whole genome sequencing analysis of a small cohort of trios (25 HLHS patients and their parents), the authors investigated a possible association between RP encoding genes and HLHS. Although the possible association between defective RPs and HLHS needs to be verified, the results suggest a novel disease mechanism in HLHS, which is a potentially substantial advance in our understanding of HLHS and CHD. The conclusions of the paper are based on solid experimental evidence from appropriate high- to medium-throughput models, while additional genetic results from an independent patient cohort are needed to verify an association between RP encoding genes and HLHS in patients.

      We thank Reviewer 2 for the thoughtful assessment of our manuscript. Our point-by-point responses to the recommendations are provided (Reviewer 2, “Recommendations for the authors”).

      Reviewer #1 (Recommendations for the authors): 

      (1) Despite an interesting surveillance model, the disease-causing mechanisms directly downstream of the RP variants remain unclear. Can the authors provide any evidence for abnormal ribosomes or defects in translation in cells harboring such variants? The possibility that reduced translation of cardiac transcription factors such as TBX5 and NKX2-5 may contribute to the functional interactions observed should be considered. How do the authors consider that the RP variants are affecting transcript levels as observed in the study?

      Our model implies that cell cycle arrest does not require abnormal ribosomes or translational defects but instead relies on the sensing of RP levels or mutations as a fitness-sensing mechanism that activates TP53/CDKN1A-dependent arrest. Supporting this framework, we observed no significant changes in TBX5 or NKX2-5 expression (data not shown), but rather an upregulation of CDKN1A levels upon RP KD.

      (2) The authors suggest that a nucleolar stress program is activated in cells harboring RP gene variants. Can they provide additional evidence for this beyond p53 activation? 

      We added additional data to support nucleolar stress (Suppl. Fig. 6) and text (lines 52635):

      To determine whether cardiac KD of RpS15Aa causes nucleolar stress in the Drosophila heart, we stained larval hearts for Fibrillarin, a marker for nucleoli and nucleolar integrity.  We found that RpS15Aa KD causes expansion of nucleolar Fibrillarin staining in cardiomyocyte, which is a hallmark of nucleolar stress (Suppl. Fig. 6A-C). As a control, we also performed cardiac KD of Nopp140, which is known to cause nucleolar stress upon loss-of-function. We found a similar expansion of Fibrillarin staining in larval cardiomyocyte nuclei (Suppl. Fig. 6C,D). This suggests that RpS15Aa KD indeed causes nucleolar stress in the Drosophila heart, that likely contributes to the dramatic heart loss in adults.

      Other recommendations: 

      (3) Concerning the cell type specificity, in the proliferation screen, were similar effects seen on the actinin negative as actinin positive EdU+ cells? It would be helpful to refer to the fibroblast result shown in Supplementary Figure 1C in the results section

      As suggested by reviewer #1, we have added a reference to Supplementary Fig. 1C, D and noted that RP knockdown exerts a non–CM-specific effect on proliferation.

      (4) The authors refer to HLHS patients with atrial septal defects and reduced right ventricular ejection fraction. Please clarify the specificity of the new findings to HLHS versus other forms of CHD, as implied in several places in the manuscript, including the abstract.

      This study focused on a cohort of 25 HLHS proband-parent trios selected for poor clinical outcome, including restrictive atrial septal defect and reduced right ventricular ejection fraction.  We have revised the following sentence  in response to the Reviewer’s comment (lines 567-571): “While our study highlights the potential of this approach for gene prioritization, additional research is needed to directly demonstrate the functional consequence of the identified genetic variants, verify an association between RP encoding genes and HLHS in other patient cohorts with and without poor outcome, and determine if RP variants have a broader role in CHD susceptibility.

      (5) The multi-model approach taken by the authors is clearly a good system for characterizing disease-causing variants. Did the authors score for cardiomyocyte proliferation or the time of phenotypic onset in the zebrafish model? 

      We used an antibody against phosphohistone 3 to identify proliferating cells and DAPI to identify all cardiac cells in control injected, rps15a morphants, and rps15a crispants. We found that  cell numbers and proliferating cells were significantly reduced at 24 and 48 hpf. By 72 hpf cardiac cell proliferation is greatly diminished even in controls, where proliferation typically declines. 

      Reduced ventricular cardiomyocyte numbers could potentially result from impaired addition of LTPB3-expressing progenitors. In experiments where altered cardiac rhythm is observed, please comment on the possible links to proliferation.

      Heart function data showed that heart period (R-R interval) was unaffected in morphants and crispants at 72 hpf where we also observed significant reductions in cell numbers. This suggests that the bradycardia observed in the rps15a + nkx2.5 or tbx5a double KD (Sup. Fig. 5D & E) was not due to the reduction in cell numbers alone. 

      Author response image 1.

      Finally, the use of the mouse to model HLHS in potential follow-up studies should be discussed. 

      We have added a mouse model comment to the discussion (lines 571-74): “In conclusion, we propose that the approach outlined in this study provides a novel framework for rapidly prioritizing candidate genes and systematically testing them, individually or in combination, using a CRISPR/Cas9 genome-editing strategy in mouse embryos (PMID: 28794185)”.

      (6) When the authors scored proliferation in cells from the proband in family 75H, did they validate that RPS15A expression is reduced, consistent with a regulatory region defect? 

      Good point. We examined RPS15A expression in these cells and found no significant reduction in gene expression in day 25 cardiomyocytes (data not shown). One possible explanation is that this variant may regulate RPS15A expression in a stage-specific manner during differentiation or under additional stress conditions.

      (7) Minor point. Typo on line 494: comma should be placed after KD, not before.

      Thank you, this has now been corrected (new line 490)

      Reviewer #2 (Recommendations for the authors):  

      (1) The authors are invited to revise the part of the manuscript that describes the genetic analysis and provide a more balanced discussion of the WGS data, with a conclusion that aligns with the strength of the human genetic data. 

      We disagree with reviewer #2’s assessment. The goal of our study is not to apply a classical genetic approach to establish variant pathogenicity, but rather to employ a multidisciplinary framework to prioritize candidate genes and variants and to examine their roles in heart development using model systems. In this context, genetic analysis serves primarily as a filtering tool rather than as a means of definitively establishing causality.

      (2) The genetic analysis of patients does not appear to provide strong evidence for an association between RP gene variants and HLHS. More information regarding methodology and the identified variants is needed. 

      HLHS is widely recognized as an oligogenic and heterogeneous genetic disease in which traditional genetic analyses have consistently failed to prioritize any specific gene class as reviewer#2 is pointing out. Therefore, relying solely on genetic analysis is unlikely to yield strong evidence for association with a given gene class. This limitation provides the rationale for our multidisciplinary gene prioritization strategy, which leverages model systems to interrogate candidate gene function. Ultimately, definitive validation of this approach will require studies in relevant in vivo models to establish causality within the context of a four-chambered heart (see also Discussion).

      In Table S2, it would be appropriate to provide information on sequence, MAF, and CADD. Please note the source of MAF% (GnomAD version?, which population?).  

      As summarized in Figure 2A, the 292 genes from the families with the 25 proband with poor outcome displayed in Supplemental Table 2 fulfilled a comprehensive candidate gene prioritization algorithm based on the variant, gene, inheritance, and enrichment, which required all of the following: 1) variants identified by whole genome sequencing with minor allele frequency <1%; 2) missense, loss-of-function, canonical splice, or promoter variants; 3) upper quartile fetal heart expression; and 4)De novo or recessive inheritance. Unbiased network analysis of these 292 genes, which are displayed in Supplemental Table 2 for completeness, identified statistically significant enrichment of ribosomal proteins. The details about MAF, CADD score, and sequence highlighted by the Reviewer are provided for the RP genes in Table 1, which are central to the focus and findings of the manuscript.    

      It would also be helpful for the reader if genome coordinates (e.g., 16-11851493-G-A for RSL1D1 p.A7V) were provided for each variant in both Table 1 and S2.

      Genome coordinates have been added to Table 1.

      (3) The dataset from the hPSC-CM screen could be of high value for the community. It would be appropriate if the complete dataset were made available in a usable format. 

      The dataset from the hPSC-CM screen has been added to the manuscript as Supp Table 1

      (4) The "rare predicted-damaging promoter variant in RPS15A" (c.-95G>A) does not appear so rare. Considering the MAF of 0,00662, the frequency of heterozygous carriers of this variant is 1 out of 76 individuals in the general population. Thus, considering the frequency of HLHS in the population (2-3 out of 10,000) and the small size of family 75H, the data do not appear to indicate any association between this particular variant and HLHS. The variants in Table 1 also appear to have relatively mild effects on the gene product, judging from the MAF and CADD scores. The authors are invited to discuss why they find these variants disease-causing in HLHS

      Our study design is based on the widely held premise that HLHS is an oligogenic disorder. Our multi-model systems platform centered on comprehensive filtering of coding and regulatory variants identified by whole genome sequencing of HLHS probands to identify candidate genes associated with susceptibility to this rare developmental phenotype. 75H proved to be a high-value family for generating a relatively short list of candidate genes for left-sided CHD. Given the rarity of both left-sided CHD and the RPS15A variant identified in the HLHS proband and his 5th degree relative, with a frequency consistent with a risk allele for an oligogenic disorder, we made the reasonable assumption that this was a bona fide genotype-phenotype association rather than a chance occurrence. Moreover, incomplete penetrance and variable expression is consistent with a genetically complex basis of disease whereby the shared variant is risk-conferring and acts in conjunction with additional genetic, epigenetic, and/or environmental factors that lead to a left-sided CHD phenotype. In sum, we do not claim these variants are definitively disease causing, but rather potentially contributing risk factors.

      (5) Information is lacking on how clustering of RP genes was demonstrated using STRING (with P-values that support the conclusions). What is meant by "when the highest stringency filter was applied"? Does this refer to the STRING interaction score or something else? The authors could also explain which genes were used to search STRING (e.g., all 292 candidate genes) and provide information on the STRING interaction score used in the analysis, the number of nodes and edges in the network.

      To determine whether certain gene networks were over-represented, two online bioinformatics tools were used. First, genes were inputted into STRING (Author response table 2 below) to investigate experimental and predicted protein-protein and genetic interactions. Clustering of ribosomal protein genes was demonstrated when applying the highest stringency filter. Next, genes were analyzed for potential enrichment of genes by ontology classification using PANTHER .Applying Fisher’s exact test and false discovery rate corrections, ribosomal proteins were the most enriched class when compared to the reference proteome, including data annotated by molecular function (4.84-fold, p=0.02), protein class (6.45-fold, p=0.00001), and cellular component (9.50fold, p=0.001). A majority of the identified RP candidate genes harbored variants that fit a recessive inheritance disease model.

      Author response image 2.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      “The study analyzes the gastric fluid DNA content identified as a potential biomarker for human gastric cancer. However, the study lacks overall logicality, and several key issues require improvement and clarification. In the opinion of this reviewer, some major revisions are needed:” 

      (1) “This manuscript lacks a comparison of gastric cancer patients' stages with PN and N+PD patients, especially T0-T2 patients.”

      We are grateful for this astute remark. A comparison of gfDNA concentration among the diagnostic groups indicates a trend of increasing values as the diagnosis progresses toward malignancy. The observed values for the diagnostic groups are as follows:

      Author response table 1.

      The chart below presents the statistical analyses of the same diagnostic/tumor-stage groups (One-Way ANOVA followed by Tukey’s multiple comparison tests). It shows that gastric fluid gfDNA concentrations gradually increase with malignant progression. We observed that the initial tumor stages (T0 to T2) exhibit intermediate gfDNA levels, which in this group is significantly lower than in advanced disease (p = 0.0036), but not statistically different from non-neoplastic disease (p = 0.74).

      Author response image 1.

      (2) “The comparison between gastric cancer stages seems only to reveal the difference between T3 patients and early-stage gastric cancer patients, which raises doubts about the authenticity of the previous differences between gastric cancer patients and normal patients, whether it is only due to the higher number of T3 patients.”

      We appreciate the attention to detail regarding the numbers analyzed in the manuscript. Importantly, the results are meaningful because the number of subjects in each group is comparable (T0-T2, N = 65; T3, N = 91; T4, N = 63). The mean gastric fluid gfDNA values (ng/µL) increase with disease stage (T0-T2: 15.12; T3-T4: 30.75), and both are higher than the mean gfDNA values observed in non-neoplastic disease (10.81 ng/µL for N+PD and 10.10 ng/µL for PN). These subject numbers in each diagnostic group accurately reflect real-world data from a tertiary cancer center.

      (3) “The prognosis evaluation is too simplistic, only considering staging factors, without taking into account other factors such as tumor pathology and the time from onset to tumor detection.”

      Histopathological analyses were performed throughout the study not only for the initial diagnosis of tissue biopsies, but also for the classification of Lauren’s subtypes, tumor staging, and the assessment of the presence and extent of immune cell infiltrates. Regarding the time of disease onset, this variable is inherently unknown--by definition--at the time of a diagnostic EGD. While the prognosis definition is indeed straightforward, we believe that a simple, cost-effective, and practical approach is advantageous for patients across diverse clinical settings and is more likely to be effectively integrated into routine EGD practice.

      (4) “The comparison between gfDNA and conventional pathological examination methods should be mentioned, reflecting advantages such as accuracy and patient comfort. “

      We wish to reinforce that EGD, along with conventional histopathology, remains the gold standard for gastric cancer evaluation. EGD under sedation is routinely performed for diagnosis, and the collection of gastric fluids for gfDNA evaluation does not affect patient comfort. Thus, while gfDNA analysis was evidently not intended as a diagnostic EGD and biopsy replacement, it may provide added prognostic value to this exam.

      (5) “There are many questions in the figures and tables. Please match the Title, Figure legends, Footnote, Alphabetic order, etc. “

      We are grateful for these comments and apologize for the clerical oversight. All figures, tables, titles and figure legends have now been double-checked.

      (6) “The overall logicality of the manuscript is not rigorous enough, with few discussion factors, and cannot represent the conclusions drawn. “

      We assume that the unusual wording remark regarding “overall logicality” pertains to the rationale and/or reasoning of this investigational study. Our working hypothesis was that during neoplastic disease progression, tumor cells continuously proliferate and, depending on various factors, attract immune cell infiltrates. Consequently, both tumor cells and immune cells (as well as tumor-derived DNA) are released into the fluids surrounding the tumor at its various locations, including blood, urine, saliva, gastric fluids, and others. Thus, increases in DNA levels within some of these fluids have been documented and are clinically meaningful. The concurrent observation of elevated gastric fluid gfDNA levels and immune cell infiltration supports the hypothesis that increased gfDNA—which may originate not only from tumor cells but also from immune cells—could be associated with better prognosis, as suggested by this study of a large real-world patient cohort.

      In summary, we thank Reviewer #1 for his time and effort in a constructive critique of our work.

      Reviewer #2 (Public review):

      Summary: 

      “The authors investigated whether the total DNA concentration in gastric fluid (gfDNA), collected via routine esophagogastroduodenoscopy (EGD), could serve as a diagnostic and prognostic biomarker for gastric cancer. In a large patient cohort (initial n=1,056; analyzed n=941), they found that gfDNA levels were significantly higher in gastric cancer patients compared to non-cancer, gastritis, and precancerous lesion groups. Unexpectedly, higher gfDNA concentrations were also significantly associated with better survival prognosis and positively correlated with immune cell infiltration. The authors proposed that gfDNA may reflect both tumor burden and immune activity, potentially serving as a cost-effective and convenient liquid biopsy tool to assist in gastric cancer diagnosis, staging, and follow-up.”

      Strengths: 

      “This study is supported by a robust sample size (n=941) with clear patient classification, enabling reliable statistical analysis. It employs a simple, low-threshold method for measuring total gfDNA, making it suitable for large-scale clinical use. Clinical confounders, including age, sex, BMI, gastric fluid pH, and PPI use, were systematically controlled. The findings demonstrate both diagnostic and prognostic value of gfDNA, as its concentration can help distinguish gastric cancer patients and correlates with tumor progression and survival. Additionally, preliminary mechanistic data reveal a significant association between elevated gfDNA levels and increased immune cell infiltration in tumors (p=0.001).”

      Reviewer #2 has conceptually grasped the overall rationale of the study quite well, and we are grateful for their assessment and comprehensive summary of our findings.

      Weaknesses: 

      (1) “The study has several notable weaknesses. The association between high gfDNA levels and better survival contradicts conventional expectations and raises concerns about the biological interpretation of the findings.“

      We agree that this would be the case if the gfDNA was derived solely from tumor cells. However, the findings presented here suggest that a fraction of this DNA would be indeed derived from infiltrating immune cells. The precise determination of the origin of this increased gfDNA remains to be achieved in future follow-up studies, and these are planned to be evaluated soon, by applying DNA- and RNA-sequencing methodologies and deconvolution analyses.

      (2) “The diagnostic performance of gfDNA alone was only moderate, and the study did not explore potential improvements through combination with established biomarkers. Methodological limitations include a lack of control for pre-analytical variables, the absence of longitudinal data, and imbalanced group sizes, which may affect the robustness and generalizability of the results.“

      Reviewer #2 is correct that this investigational study was not designed to assess the diagnostic potential of gfDNA. Instead, its primary contribution is to provide useful prognostic information. In this regard, we have not yet explored combining gfDNA with other clinically well-established diagnostic biomarkers. We do acknowledge this current limitation as a logical follow-up that must be investigated in the near future.

      Moreover, we collected a substantial number of pre-analytical variables within the limitations of a study involving over 1,000 subjects. Longitudinal samples and data were not analyzed here, as our aim was to evaluate prognostic value at diagnosis. Although the groups are imbalanced, this accurately reflects the real-world population of a large endoscopy center within a dedicated cancer facility. Subjects were invited to participate and enter the study before sedation for the diagnostic EGD procedure; thus, samples were collected prospectively from all consenting individuals.

      Finally, to maintain a large, unbiased cohort, we did not attempt to balance the groups, allowing analysis of samples and data from all patients with compatible diagnoses (please see Results: Patient groups and diagnoses).

      (3) “Additionally, key methodological details were insufficiently reported, and the ROC analysis lacked comprehensive performance metrics, limiting the study's clinical applicability.“

      We are grateful for this useful suggestion. In the current version, each ROC curve (Supplementary Figures 1A and 1B) now includes the top 10 gfDNA thresholds, along with their corresponding sensitivity and specificity values (please see Suppl. Table 1). The thresholds are ordered from-best-to-worst based on the classic Youden’s J statistic, as follows:

      Youden Index = specificity + sensitivity – 1 [Youden WJ. Index for rating diagnostic tests. Cancer 3:32-35, 1950. PMID: 15405679]. We have made an effort to provide all the key methodological details requested, but we would be glad to add further information upon specific request.

      Reviewer #1 (Recommendations for the authors):

      The authors should pay attention to ensuring uniformity in the format of all cited references, such as the number of authors for each reference, the journal names, publication years, volume numbers, and page number formats, to the best extent possible. 

      Thank you for pointing this inconsistency. All cited references have now been revisited and adjusted properly. We apologize for this clerical oversight.

      Reviewer #2 (Recommendations for the authors):

      (1) “High gfDNA levels were surprisingly linked to better survival, which conflicts with the conventional understanding of cfDNA as a tumor burden marker. Was any qualitative analysis performed to distinguish DNA derived from immune cells versus tumor cells?“

      Tumor-derived DNA is certainly present in gfDNA, as our group has unequivocally demonstrated in a previous publication [Pizzi M. P., et al. (2019) Identification of DNA mutations in gastric washes from gastric adenocarcinoma patients: Possible implications for liquid biopsies and patient follow-up Int J Cancer 145:1090–1097. DOI: 10.1002/ijc.32114]. However, in the present manuscript, our data suggest that gfDNA may also contain DNA derived from infiltrating immune cells. This may also be the case for other malignancies, and qualitative deconvolution studies could provide more informative information. To achieve this, DNA sequencing and RNA-Seq analyses may offer relevant evidence. Our study should be viewed as an original and preliminary analysis that may encourage such quantitative and qualitative studies in biofluids from cancer patients. Currently, this is a simple approach (which might be its essential beauty), but we hope to investigate this aspect further in future studies.

      (2) “The ROC curve AUC was 0.66, indicating only moderate discrimination ability. Did the authors consider combining gfDNA with markers such as CEA or CA19-9 to improve diagnostic accuracy?“

      This is indeed a logical idea, which shall certainly be explored in planned follow-up studies.

      (3) “DNA concentration could be influenced by non-biological factors, including gastric fluid pH, sampling location, time delay, or freeze-thaw cycles. Were these operational variables assessed for their effect on data stability?“

      We appreciate the rigor of the evaluation. Yes, information regarding gastric fluid pH was collected. All samples were collected from the stomach during EGD procedure. Samples were divided in aliquots and were thawed only once. This information is now provided in the updated manuscript text.

      (4) “This cross-sectional study lacks data on gfDNA changes over time, limiting conclusions on its utility for monitoring treatment response or predicting recurrence.“

      Again, temporal evaluation is another excellent point, and it will be the subject of future analyses. In this exploratory study, samples were collected at diagnosis, at a single point. We have not obtained serial samples, as participants received appropriate therapy soon following diagnosis.

      (5) The normal endoscopy group included only 10 patients, the precancerous lesion group 99 patients, while the gastritis group had 596 patients. Such uneven sample sizes may affect statistical reliability and generalizability. Has weighted analysis or optimized sampling been considered for future studies?“

      Yes, in future studies this analysis will be considered, probably by employing stratified random sampling with relevant patient attributes recorded.

      (6) “The SciScore was only 2 points, indicating that key methodological details such as inclusion/exclusion criteria, randomization, sex variables, and power calculation were not clearly described. It is recommended that these basic research elements be supplemented in the Methods section. “

      This was an exploratory research, the first of its kind, to evaluate prognostic potential of gfDNA in the context of gastric cancer. Patients were not included if they did not sign the informed consent or excluded if they withdrew after consenting. Other exclusion criteria included diagnoses of conditions such as previous gastrectomy or esophagectomy, or the presence of non-gastric malignancies. Randomization and power analyses were not applicable, as no prior data were available regarding gfDNA concentration values or its diagnostic/prognostic potential. All subjects, regardless of sex, were invited to participate without discrimination or selection.

      (7) “Although a ROC curve was provided in the supplementary materials (Supplementary Figure 1), only the curve and AUC value were shown without sensitivity, specificity, predictive values, or cutoff thresholds. The authors are advised to provide a full ROC performance assessment to strengthen the study's clinical relevance.

      These data are now given alongside the ROC curves in the Supplementary Information section, specifically in Supplementary Figure 1 and in the newly added Supplementary Table 1.

      We thank Reviewer #2 for an insightful and positive overall assessment of our work.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      This manuscript reports a dual-task experiment intended to test whether language prediction relies on executive resources, using surprisal-based measures of predictability and an n-back task to manipulate cognitive load. While the study addresses a question under debate, the current design and modeling framework fall short of supporting the central claims. Key components of cognitive load, such as task switching, word prediction vs integration, are not adequately modeled. Moreover, the weak consistency in replication undermines the robustness of the reported findings. Below unpacks each point. 

      Cognitive load is a broad term. In the present study, it can be at least decomposed into the following components: 

      (1)  Working memory (WM) load: news, color, and rank. 

      (2)  Task switching load: domain of attention (color vs semantics), sensorimotor rules (c/m vs space).

      (3)  Word comprehension load (hypothesized against): prediction, integration. 

      The components of task switching load should be directly included in the statistical models. Switching of sensorimotor rules may be captured by the "n-back reaction" (binary) predictor. However, the switching of attended domains and the interaction between domain switching and rule complexity (1-back or 2-back) were not included. The attention control experiment (1) avoided useful statistical variation from the Read Only task, and (2) did not address interactions. More fundamentally, task-switching components should be directly modeled in both performance and full RT models to minimize selection bias. This principle also applies to other confounding factors, such as education level. While missing these important predictors, the current models have an abundance of predictors that are not so well motivated (see later comments). In sum, with the current models, one cannot determine whether the reduced performance or prolonged RT was due to affecting word prediction load (if it exists) or merely affecting the task switching load. 

      The entropy and surprisal need to be more clearly interpreted and modeled in the context of the word comprehension process. The entropy concerns the "prediction" part of the word comprehension (before seeing the next word), whereas surprisal concerns the "integration" part as a posterior. This interpretation is similar to the authors writing in the Introduction that "Graded language predictions necessitate the active generation of hypotheses on upcoming words as well as the integration of prediction errors to inform future predictions [1,5]." However, the Results of this study largely ignored entropy (treating it as a fixed effect) and only focus on surprisal without clear justification. 

      In Table S3, with original and replicated model fitting results, the only consistent interaction is surprisal x age x cognitive load [2-back vs. Reading Only]. None of the two-way interactions can be replicated. This is puzzling and undermines the robustness of the main claims of this paper. 

      Reviewer #2 (Public review):

      Summary

      This paper considers the effects of cognitive load (using an n-back task related to font color), predictability, and age on reading times in two experiments. There were main effects of all predictors, but more interesting effects of load and age on predictability. The effect of load is very interesting, but the manipulation of age is problematic, because we don't know what is predictable for different participants (in relation to their age). There are some theoretical concerns about prediction and predictability, and a need to address literature (reading time, visual world, ERP studies). 

      Strengths/weaknesses 

      It is important to be clear that predictability is not the same as prediction. A predictable word is processed faster than an unpredictable word (something that has been known since the 1970/80s), e.g., Rayner, Schwanenfluegel, etc. But this could be due to ease of integration. I think this issue can probably be dealt with by careful writing (see point on line 18 below). To be clear, I do not believe that the effects reported here are due to integration alone (i.e., that nothing happens before the target word), but the evidence for this claim must come from actual demonstrations of prediction. 

      The effect of load on the effects of predictability is very interesting (and also, I note that the fairly novel way of assessing load is itself valuable). Assuming that the experiments do measure prediction, it suggests that they are not cost-free, as is sometimes assumed. I think the researchers need to look closely at the visual world literature, most particularly the work of Huettig. (There is an isolated reference to Ito et al., but this is one of a large and highly relevant set of papers.) 

      There is a major concern about the effects of age. See the Results (161-5): this depends on what is meant by word predictability. It's correct if it means the predictability in the corpus. But it may or may not be correct if it refers to how predictable a word is to an individual participant. The texts are unlikely to be equally predictable to different participants, and in particular to younger vs. older participants, because of their different experiences. To put it informally, the newspaper articles may be more geared to the expectations of younger people. But there is also another problem: the LLM may have learned on the basis of language that has largely been produced by young people, and so its predictions are based on what young people are likely to say. Both of these possibilities strike me as extremely likely. So it may be that older adults are affected more by words that they find surprising, but it is also possible that the texts are not what they expect, or the LLM predictions from the text are not the ones that they would make. In sum, I am not convinced that the authors can say anything about the effects of age unless they can determine what is predictable for different ages of participants. I suspect that this failure to control is an endemic problem in the literature on aging and language processing and needs to be systematically addressed. 

      Overall, I think the paper makes enough of a contribution with respect to load to be useful to the literature. But for discussion of age, we would need something like evidence of how younger and older adults would complete these texts (on a word-by-word basis) and that they were equally predictable for different ages. I assume there are ways to get LLMs to emulate different participant groups, but I doubt that we could be confident about their accuracy without a lot of testing. But without something like this, I think making claims about age would be quite misleading. 

      We thank both reviewers for their constructive feedback and for highlighting areas where our theoretical framing and analyses could be clarified and strengthened. We have carefully considered each of the points raised and made substantial additions and revisions.

      As a summary, we have directly addressed the concerns raised by the reviewers by incorporating task-switching predictors into the statistical models, paralleling our focus on surprisal with a full analysis and interpretation of entropy, clarifying the robustness (and limitations) of the replicated findings, and addressing potential limitations in our Discussion.

      We believe these revisions substantially strengthen the manuscript and improve the reading flow, while also clarifying the scope of our conclusions. We will not illustrate these changes in more detail:

      (1) Cognitive load and task-switching components.

      We agree that cognitive load is a multifaceted construct, particularly since our secondary task broadly targets executive functioning. In response to Reviewer 1, we therefore examined task-switching demands more closely by adding the interaction term n-back reaction × cognitive load to a model restricted to 1-back and 2-back Dual Task blocks (as there were no n-back reactions in the Reading Only condition). This analysis showed significantly longer reading times in the 2-back than in the 1back condition, both for trials with and without an n-back reaction. Interestingly, the difference between reaction and no-reaction trials was smaller in the 2-back condition (β = -0.132, t(188066.09) = -34.269, p < 0.001), which may simply reflect the general increase in reading time for all trials so that the effect of the button press time decreases in comparison to the 1-back. In that sense, these findings are not unexpected and largely mirror the main effect of cognitive load. Crucially, however, the three-way interaction of cognitive load, age, and surprisal remained robust (β = 0.00004, t(188198.86) = 3.540, p < 0.001), indicating that our effects cannot be explained by differences in taskswitching costs across load conditions. To maintain a streamlined presentation, we opted not to include this supplementary analysis in the manuscript.

      (2) Entropy analyses.

      Reviewer 1 pointed out that our initial manuscript placed more emphasis on surprisal. In the revised manuscript, we now report a full set of entropy analyses in the supplementary material. In brief, these analyses show that participants generally benefit from lower entropy across cognitive load conditions, with one notable exception: young adults in the Reading Only condition, where higher entropy was associated with faster reading times. We have added these results to the manuscript to provide a more complete picture of the prediction versus integration distinction highlighted in the review (see sections “Control Analysis: Disentangling the Effect of Cognitive Load on Pre- and PostStimulus Predictive Processing” in the Methods and “Disentangling the Effect of Cognitive Load on Pre- and Post-Stimulus Predictive Processing“ in the Results).

      (3) Replication consistency.

      Reviewer 1 noted that the results of the replication analysis were somewhat puzzling. We take this point seriously and agree that the original model was likely underpowered to detect the effect of interest. To address this, we excluded the higher-level three-way interaction of age, cognitive load, and surprisal, focusing instead on the primary effect examined in this paper: the modulatory influence of cognitive load on surprisal. Using this approach, we observed highly consistent results between the original online subsample and the online replication sample.

      (4) Potential age bias in GPT-2.  

      We thank Reviewer 2 for their thoughtful and constructive feedback and agree that a potential age bias in GPT-2’s next-token predictions warrants caution. We thus added a section in the Discussion explicitly considering this limitation, and explain why it should not affect the implications of our study.

      Reviewer #1 (Recommendations for the authors):

      The d-prime model operates at the block level. How many observation goes into the fitting (about 175*8=1050)? How can the degrees of freedom of a certain variable go up to 188435? 

      We thank the reviewer for spotting this issue. Indeed, there was an error in our initial calculations, which we have now corrected in the manuscript. Importantly, the correction does not meaningfully affect the results for the analysis of d-primes or the conclusions of the study (see line 102).  

      “A linear mixed-effects model revealed n-back performance declined with cognitive load (β = -1.636, t(173.13) = -26.120, p < 0.001), with more pronounced effects with advancing age (β = -0.014, t(169.77) = -3.931, p > 0.001; Fig. 3b, Table S1)”.

      Consider spelling out all the "simple coding schemes" explicitly. 

      We thank the reviewer for this helpful suggestion. In the revised manuscript, we have now included the modelled contrasts in brackets after each predictor variable.

      “Example from line 527: In both models, we included recording location (online vs. lab), cognitive load (1-back and 2back Dual Task vs. Reading Only as the reference level) and continuously measured age (centred) in both models as well as the interaction of age and cognitive load as fixed effects”.

      The relationship between comprehension accuracy and strategies for color judgement is unclear or not intuitive. 

      We thank the reviewer for this helpful comment. The n-back task, which required participants to judge colours, was administered at the single-trial level, with colours pseudorandomised to prevent any specific colour - or sequence of colours - from occurring more frequently than others. In contrast, comprehension questions were presented at the end of each block, meaning that trial-level stimulus colour was unrelated to accuracy on the block-level comprehension questions. However, we agree that this distinction may not have been entirely clear, and we have now added a brief clarification in the Methods section to address this point (see line 534):  

      “Please note that we did not control for trial-level stimulus colour here. The n-back task, which required participants to judge colours, was administered at the single-trial level, with colours pseudorandomised to prevent any specific colour - or sequence of colours - from occurring more frequently than others. In contrast, comprehension questions were presented at the end of each block, meaning that trial-level stimulus colour was unrelated to accuracy on the blocklevel comprehension questions”.

      Could you explain why comprehension accuracy is not modeled in the same way as d-prime, i.e., with a similar set of predictors? 

      This is a very good point. After each block, participants answered three comprehension questions that were intentionally designed to be easy: they could all be answered correctly after having read the corresponding text, but not by common knowledge alone. The purpose of these questions was primarily to ensure participants paid attention to the texts and to allow exclusion of participants who failed to understand the material even under minimal cognitive load. As comprehension accuracy was modelled at the block level with 3 questions per block, participants could achieve only discrete scores of 0%, 33.3%, 66.7%, or 100%. Most participants showed uniformly high accuracy across blocks, as expected if the comprehension task fulfilled its purpose. However, this limited variance in performance caused convergence issues when fitting a comprehension-accuracy model at the same level of complexity as the d′ model. To model comprehension accuracy nonetheless, we therefore opted for a reduced model complexity in this analysis.

      RT of previous word: The motivations described in the Methods, such as post-error-slowing and sequential modulation effects, lack supporting evidence. The actual scope of what this variable may account for is unclear.  

      We are happy to elaborate further regarding the inclusion of this predictor. Reading times, like many sequential behavioral measures, exhibit strong autocorrelation (Schuckart et al., 2025, doi: 10.1101/2025.08.19.670092). That is, the reading time of a given word is partially predictable from the reading time of the previous word(s). Such spillover effects can confound attempts to isolate trialspecific cognitive processes. As our primary goal was to model single-word prediction, we explicitly accounted for this autocorrelation by including the log reading time of the preceding trial as a covariate. This approach removes variance attributable to prior behavior, ensuring that the estimated effects reflect the influence of surprisal and cognitive load on the current word, rather than residual effects of preceding trials. We now added this explanation to the manuscript (see line 553):

      “Additionally, it is important to consider that reading times, like many sequential behavioural measures, exhibit strong autocorrelation (Schuckart et al., 2025), meaning that the reading time of a given word is partially predictable from the reading time of the previous word. Such spillover effects can confound attempts to isolate trial-specific cognitive processes. As our primary goal was to model single-word prediction, we explicitly accounted for this autocorrelation by including the reading time of the preceding trial as a covariate”.  

      Block-level d-prime: It was shown with the d-prime performance model that block-level d-prime is a function of many of the reading-related variables. Therefore, it is not justified to use them here as "a proxy of each participant's working memory capacity."

      We thank the reviewer for their comment. We would like to clarify that the d-prime performance model indeed included only dual-task d-primes (i.e., d-primes obtained while participants were simultaneously performing the reading task). In contrast, the predictor in question is based on singletask d-primes, which are derived from the n-back task performed in isolation. While dual- and singletask d-primes may be correlated, they capture different sources of variance, justifying the use of single-task d-primes here as a measure of each participant’s working memory capacity.

      Word frequency is entangled with entropy and surprisal. Suggest removal.

      We appreciate the reviewer’s comment. While word frequency is correlated with word surprisal, its inclusion does not affect the interpretation of the other predictors and does not introduce any bias. Moreover, it is a theoretically important control variable in reading research. Since we are interested in the effects of surprisal and entropy beyond potential biases through word length and frequency, we believe these are important control variables in our model. Moreover, checks for collinearity confirmed that word frequency was neither strongly correlated with surprisal nor entropy. In this sense, including it is largely pro forma: it neither harms the model nor materially changes the results, but it ensures that the analysis appropriately accounts for a well-established influence on word processing.

      Entropy reflects the cognitive load of word prediction. It should be investigated in parallel and with similar depth as surprisal (which reflects the load of integration).

      This is an excellent point that warrants further investigation, especially since the previous literature on the effects of entropy on reading time is scarce and somewhat contradictory. We have thus added additional analyses and now report the effects of cognitive load, entropy, and age on reading time (see sections “Disentangling the Effect of Cognitive Load on Pre- and Post-Stimulus Predictive Processing” in the Results, “Control Analysis: Disentangling the Effect of Cognitive Load on Pre- and Post-Stimulus Predictive Processing” in the Methods as well as Fig. S7 and Table S6 in the Supplements for full results). In brief, we observe a significant three-way interaction among age, cognitive load, and entropy. Specifically, while all participants benefit from low entropy under high cognitive load, reflected by shorter reading times, in the baseline condition this benefit is observed only in older adults. Interestingly, in the baseline condition with minimal cognitive load, younger adults even show a benefit from high entropy. Thus, although the overall pattern for entropy partly mirrors that for surprisal – older adults showing increased reading times when word entropy is high and generally greater sensitivity to entropy variations – the effects differ in one important respect. Unlike for surprisal, the detrimental impact of increased word entropy is more pronounced under high cognitive load across all participants.

      Reviewer #2 (Recommendations for the authors):

      I agree in relation to prediction/load, but I am concerned (actually very concerned) that prediction needs to be assessed with respect to age. I suspect this is one reason why there is so much inconsistency in the effects of age in prediction and, indeed, comprehension more generally. I think the authors should either deal with it appropriately or drop it from the manuscript.

      Thank you for raising this important concern. It is true that prediction is a highly individual, complex process as it depends upon the experiences a person has made with language over their lifespan. As such, one-size-fits-all approaches are not sufficient to model predictive processing. In our study, we thus took particular care to ensure that our analyses captured both age-related and other interindividual variability in predictive processing.

      First, in our statistical models, we included age not only as a nuisance regressor, but also assessed age-related effects in the interplay of surprisal and cognitive load. By doing so, we explicitly model potential age-related differences in how individuals of different ages predict language under different levels of cognitive load.

      Second, we hypothesised that predictive processing might also be influenced by a range of interindividual factors beyond age, including language exposure, cognitive ability, and more transient states such as fatigue. To capture such variability, all models included by-subject random intercepts and slopes, ensuring that unmodelled individual differences were statistically accommodated.

      Together, these steps allow us to account for both systematic age-related differences and residual individual variability in predictive processing. We are therefore confident that our findings are not confounded by unmodelled age-related variability.

      Line 18, do not confuse prediction (or pre-activation) with predictability. Predictability effects can be due to integration difficulty. See Pickering and Gambi 2018 for discussion. The discussion then focuses on graded parallel predictions, but there is also a literature concerned with the prediction of one word, typically using the "visual world" paradigm (which is barely cited - Reference 60 is an exception). In the next paragraph, I would recommend discussing the N400 literature (particularly Federmeier). There are a number of reading time studies that investigate whether there is a cost to a disconfirmed prediction - often finding no cost (e.g., Frisson, 2017, JML), though there is some controversy and apparent differences between ERP and eye-tracking studies (e.g., Staub). This literature should be addressed. In general, I appreciate the value of a short introduction, but it does seem too focused on neuroscience rather than the very long tradition of behavioural work on prediction and predictability.

      We thank the reviewer for this suggestion. In the revised manuscript, we have clarified the relevant section of the introduction to avoid confusion between predictability and predictive processing, thereby improving conceptual clarity (see line 16).

      “Instead, linguistic features are thought to be pre-activated broadly rather than following an all-or-nothing principle, as there is evidence for predictive processing even for moderately- or low-restraint contexts (Boston et al., 2008; Roland et al., 2012; Schmitt et al., 2021; Smith & Levy, 2013)”.  

      We also appreciate the reviewer’s comment regarding the introduction. While our study is behavioural, we frame it in a neuroscience context because our findings have direct implications for understanding neural mechanisms of predictive processing and cognitive load. We believe that this framing is important for situating our results within the broader literature and highlighting their relevance for future neuroscience research.

      I don't think 2 two-word context is enough to get good indicators of predictability. Obviously, almost anything can follow "in the", but the larger context about parrots presumably gives a lot more information. This seems to me to be a serious concern - or am I misinterpreting what was done? 

      This is a very important point and we thank the reviewer for raising it. Our goal was to generate word surprisal scores that closely approximate human language predictions. In the manuscript, we report analyses using a 2-word context window, following recommendations by Kuribayashi et al. (2022).

      To evaluate the impact of context length, we also tested longer windows of up to 60 words (not reported). While previous work (Goldstein et al., 2022) shows that GPT-2 predictions can become more human-like with longer context windows, we found that in our stimuli – short newspaper articles of only 300 words – surprisal scores from longer contexts were highly correlated with the 2word context, and the overall pattern of results remained unchanged. To illustrate, surprisal scores generated with a 10-word context window and surprisal scores generated with the 2-word context window we used in our analyses correlated with Spearman’s ρ = 0.976.

      Additionally, on a more technical note, using longer context windows reduces the number of analysable trials, since surprisal cannot be computed for the first k words of a text with a k-word context window (e.g., a 50-word context would exclude ~17% of the data).  

      Importantly, while a short 2-word context window may introduce additional noise in the surprisal estimates, this would only bias effects toward zero, making our analyses conservative rather than inflating them. Critically, the observed effects remain robust despite this conservative estimate, supporting the validity of our findings.

      However, we agree that this is a particularly important and sensitive point, and have now added a discussion of it to the manuscript (see line 476).

      “Entropy and surprisal scores were estimated using a two-word context window. While short contexts have been shown to enhance GPT-2’s psychometric alignment with human predictions, making next-word predictions more human-like (Kuribayashi et al., 2022), other work suggests that longer contexts can also increase model–human similarity (Goldstein et al., 2022). To reconcile these findings in our stimuli and guide the choice of context length, we tested longer windows and found surprisal scores were highly correlated with the 2-word context (e.g., 10-word vs. 2-word context: Spearman’s ρ = 0.976), with the overall pattern of results unchanged. Additionally, employing longer context windows would have also reduced the number of analysable trials, since surprisal cannot be computed for the first k words of a text with a k-word context window. Crucially, any additional noise introduced by the short context biases effect estimates toward zero, making our analyses conservative rather than inflating them”.

      Line 92, task performance, are there interactions? Interactions would fit with the experimental hypotheses. 

      Yes, we did include an interaction term of age and cognitive load and found significant effects on nback task performance (d-primes; b = -0.014, t(169.8) = -3.913, p < 0.001), but not on comprehension question accuracy (see table S1 and Fig. S2 in the supplementary material).

      Line 149, what were these values?

      We found surprisal values ranged between 3.56 and 72.19. We added this information in the manuscript (see line 143).

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank the reviewers for their comments on the initial submission, which helped us improve and extend the paper. We would like to respond specifically to reviewer #1.

      We disagree with the broad criticism of this study as being “almost entirely observational” and lacking “detailed molecular investigation”. We report structures and binding data, show mechanistic detail, identify critical residues and structural features underlying biological activity, and present biologically meaningful data demonstrating a role of the interaction of the M3 protein with collagens. We disagree that insufficient details or controls are included. We agree that our report has limitations, such as an understanding of potential emm1 strain binding to collagen, which might play a role in host tissue colonization, but not in biofilm.

      In response to issues raised in the initial review, we conducted several new experiments for the revised manuscript. We believe these strengthen what we report. Firstly, as the reviewer suggested, we conducted a binding experiment where the tertiary fold of M3-NTD was disrupted to confirm the T-shaped fold is indeed required for binding to collagen, as might be expected based on the crystal structure of the complex. To achieve this, we did not, as the reviewer states, use denatured protein in the ITC binding experiment. Instead, we used a monomeric form of M3-NTD, which does not adopt a well-defined tertiary structure, but retains all residues in the context of alpha helices. Secondly, we added more evidence for the importance of structural features (amino acid side chains defining the collagen binding site) by analysing the role of Trp103. Together, we provide clear evidence for the specific role of the T-shaped fold of M3-NTD for collagen binding.

      Responding to a constructive criticism by reviewer #1 we characterised M3-NTD mutants to demonstrate conservation of overall structure. NMR is an exquisite tool for this as it is highly sensitive to structural changes. It is not clear why the reviewer suggested we should have measured the stability of the proteins, which is irrelevant here. What matters is that the fold is conserved between mutated variants at the chosen experimental temperature (now added to the Methods section), which NMR demonstrates.

      We added errors for the ITC-derived dissociation constants.

      In the submitted versions of the paper we did not include the negative control requested by reviewer #1 for experiments shown in Figure 10 - figure supplement 1B. In our view this does not add information supporting our findings. However, we have now added two negative controls, staining of emm1 and emm28 strains. As expected, no reactivity was found with the type-specific M3 HVR antiserum while the M3 BCW antiserum showed weak reactivity, in line with some sequence similarity of the C-terminal regions of M proteins.

      Table 2 contains essential information, in line with what generally is shown in crystallographic tables in this journal. All other information can be found in the depositions of our data at the PDB. The structures have been scrutinised and checked by the PDB and passed all quality tests.

      We stated how many times experiments were done where appropriate. We now added this information for CLC assays (as given in the previously published protocol, refs. 45, 47). ITC was carried out more than once for optimization but the results of single experiments are shown (as is common practice).


      The following is the authors’ response to the original reviews.

      Many thanks for assessing our submission. We are grateful for the reviews that have informed a revised version of the paper, which includes additional data and modified text to take into account the reviewers’ comments. 

      We addressed the major limitation identified by Reviewer #1 by including data to demonstrate that collagen binding is indeed dependent on the T-shaped fold (major issue 1). Reviewer #1 suggested this needs to be done through extensive mutational work. This in our view was neither feasible nor necessary. Instead, we used ITC to measure collagen peptide binding using a monomeric form of M3, which preserves all residues including the ones involved in binding, but cannot form the T-shaped structure. This achieves the same as unravelling the T fold through mutations, but without the risk of aJecting binding through altering residues that are involved in both binding and definition of the T fold. The experiment shows a very weak interaction, confirming the fold of the M3-NTD is required for binding activity.

      Reviewer #1 finds the study limited for being “almost entirely observational”. Structural biology is by its nature observational, which is not a limitation but the very purpose of this approach. Our study goes beyond observing structures. In the first version of our paper, we identified a critical residue within a previously mapped binding site, and demonstrated through mutagenesis a causal link between presence of this residue on a tertiary fold and collagen binding activity. However, we agree this analysis could have been strengthened by additional mutagenesis, which we carried out and describe in the revised manuscript. This identifies a second residue that is critical for collagen binding. We firmed up these mutational experiments with a characterisation of mutated forms of M3 by NMR spectroscopy to confirm that these mutations did not aJect the overall fold, addressing major issue no. 2 of reviewer #1. We further demonstrate that the interaction between M3 and collagen is the cause of greatly enhanced biofilm formation as observed in patient biopsies and a tissue model of infection. We show that other streptococci that do not possess a surface protein presenting collagen binding sites like M3 do not form collagen-dependent biofilm. We therefore do not think that criticising our study for being almost entirely observational is valid. 

      Major issue 3:

      We agree with the reviewer that it would be useful to carry out experiments with k.o. and complemented strains. Such experiments go beyond the scope of our study, but might be carried out by us or others in the future. We disagree that emm1 is used “as a negative”. Instead, we established that, in contrast to emm3 strains, emm1 strain biofilm formation is not enhanced by collagen. 

      We addressed major issue 4 by quantifying colocalizations in the patient biopsies and 3D tissue model experiments.

      We thank Reviewer #2 for the thorough analysis of our reported findings. The main criticism here (issue 1) concerns the question of whether binding of emm3 streptococci would diJer to diJerent types of collagen. Our collagen peptide binding assays together with the structural data identify the collagen triple helix as the binding site for M3. While collagen types diJer in their distribution, functions and morphology in diJerent tissues, they all have in common triple-helical (COL) regions with high sequence similarity that are non-specifically recognised by M3. Therefore, our data in conjunction with the body of published work showing binding to M3 to collagens I, II, III and IV suggest it is highly likely that emm3 streptococci will indeed bind to all types of collagen in the same manner. We added a statement to the manuscript to make this point more clearly. We also added a prediction of a complex between M3 and a collagen I triple-helical peptide, which supports the idea of conserved binding mechanism for all collagen types. Whether this means all collagen types in the various tissues where they occur are targeted by emm3 streptococci is a very interesting question, however one that goes beyond the scope of our study.

      Minor issues identified by the reviewers were addressed through changes in the text and addition of figures.

      Summary of changes:

      (1) Two new authors have been added due to inclusion of additional data and analysis.

      (2) New experimental data included in section "M3-NTD harbors the collagen binding site".

      (3) Figure 3 panels A and B assigned and swapped.

      (4) Figure 4 changed to include new data and move mutant M3-NTD ITC graphs to supplement.

      (5) Table 2 corrected and amended.

      (6) AlphaFold3 quality parameters ipTM and pTM added to all figures showing predicted structures.

      (7) New supplementary figure added showing crystal packing of M3-NTD/collagen peptide complex.

      (8) Figure supplement of predicted M-protein/collagen peptide complexes includes new panel for a type I collagen peptide bound to M3.

      (9) New figure supplement showing mutant M3-NTD ITC data.

      (10) New figure supplement showing 1D <sup>1</sup>H NMR spectra of M3-NTD mutants.

      (11) Included data for additional M3-NTD mutants assessing role of Trp103 in collagen binding. Text extended to describe and place into context findings from ITC binding studies using these mutants.

      (12) Added quantitative analysis of biopsy and tissue model data (Mander's overlap coeJicient).

      (13) Corrected and extended table 3 to take into account new primers.

      (14) Added experimental details for new NMR and ITC experiments as well as new quantitative image analysis.

      (15) Minor adjustments to the text to improve clarity and correct errors.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Bacterial species that frequently undergo horizontal gene transfer events tend to have genomes that approach linkage equilibrium, making it challenging to analyze population structure and establish the relationships between isolates. To overcome this problem, researchers have established several effective schemes for analyzing N. gonorrhoeae isolates, including MLST and NG-STAR. This report shows that Life Identification Number (LIN) Codes provide for a robust and improved discrimination between different N. gonorrhoeae isolates.

      Strengths:

      The description of the system is clear, the analysis is convincing, and the comparisons to other methods show the improvements offered by LIN Codes.

      Weaknesses:

      No major weaknesses were identified by this reviewer.

      We thank the reviewer for their assessment of our paper.

      Reviewer #2 (Public review):

      Summary:

      This paper describes a new approach for analyzing genome sequences.

      Strengths:

      The work was performed with great rigor and provides much greater insights than earlier classification systems.

      Weaknesses:

      A minor weakness is that the clinical application of LIN coding could be articulated in a more in-depth way. The LIN coding system is very impressive and is certainly superior to other protocols. My recommendation, although not necessary for this paper, is that the authors expand their analysis to noncoding sequences, especially those upstream of open reading frames. In this respect, important cis-acting regulatory mutations that might help to further distinguish strains could be identified.

      We thank the reviewer for their comments. LIN code could be applied clinically, for example in the analysis of antibiotic resistant isolates, or to investigate outbreaks associated with a particular lineage. We have updated the text to note this, starting at line 432.

      In regards to non-coding sequences: unfortunately, intergenic regions are generally unsuitable for use in typing systems as (i) they are subject to phase variation, which can occlude relationships based on descent; (ii) they are inherently difficult to assemble and therefore can introduce variation due to the sequencing procedure rather than biology. For the type of variant typing that LIN code represents, which aims to replicate phylogenetic clustering, protein encoding sequences are the best choice for convenience, stability, and accuracy. This is not to say that it is not a valid object to base a nomenclature on intergenic regions, which might be especially suitable for predicting some phenotypic characters, but this will still be subject to problem (ii), depending on the sequencing technology used.  Such a nomenclature system should stand beside, rather than be combined with or used in place of, phylogenetic typing. However, we could certainly investigate the relationship between an isolates LIN code and regulatory mutations in the future.

      Reviewer #3 (Public review):

      Summary:

      In this well-written manuscript, Unitt and colleagues propose a new, hierarchical nomenclature system for the pathogen Neisseria gonorrhoeae. The proposed nomenclature addresses a longstanding problem in N. gonorrhoeae genomics, namely that the highly recombinant population complicates typing schemes based on only a few loci and that previous typing systems, even those based on the core genome, group strains at only one level of genomic divergence without a system for clustering sequence types together. In this work, the authors have revised the core genome MLST scheme for N. gonorrhoeae and devised life identification numbers (LIN) codes to describe the N. gonorrhoeae population structure.

      Strengths:

      The LIN codes proposed in this manuscript are congruent with previous typing methods for Neisseria gonorrhea, like cgMLST groups, Ng-STAR, and NG-MAST. Importantly, they improve upon many of these methods as the LIN codes are also congruent with the phylogeny and represent monophyletic lineages/sublineages.

      The LIN code assignment has been implemented in PubMLST, allowing other researchers to assign LIN codes to new assemblies and put genomes of interest in context with global datasets.

      Weaknesses:

      The authors correctly highlight that cgMLST-based clusters can be fused due n to "intermediate isolates" generated through processes like horizontal gene transfer. However, the LIN codes proposed here are also based on single linkage clustering of cgMLST at multiple levels. It is unclear if future recombination or sequencing of previously unsampled diversity within N. gonorrhoeae merges together higher-level clusters, and if so, how this will impact the stability of the nomenclature.

      The authors have defined higher resolution thresholds for the LIN code scheme. However, they do not investigate how these levels correspond to previously identified transmission clusters from genomic epidemiology studies. It would be useful for future users of the scheme to know the relevant LIN code thresholds for these investigations.

      We thank the reviewer for their insightful comments. LIN codes do use multi-level single linkage clustering to define the cluster number of isolates. However, unlike previous applications of simple single linkage clustering such as N. gonorrhoeae core genome groups (Harrison et al., 2020), once assigned in LIN code, these cluster numbers are fixed within an unchanging barcode assigned to each isolate. Therefore, the nomenclature is stable, as the addition of new isolates cannot change previously established LIN codes.

      Cluster stability was considered during the selection of allelic mismatch thresholds. By choosing thresholds based on natural breaks in population structure (Figure 3), applying clustering statistics such as the silhouette score, and by assessing where cluster stability has been maintained within the previous core genome groups nomenclature, we can have confidence that the thresholds which we have selected will form stable clusters. For example, with core genome groups there has been significant group fusion with clusters formed at a threshold of 400 allelic differences, while clustering at a threshold of 300 allelic differences has remained cohesive over time (supported by a high silhouette score) and so was selected as an important threshold in the gonococcal LIN code. LIN codes have now been applied to >27000 isolates in PubMLST, and the nomenclature has remained effective despite the continual addition of new isolates to this collection. The manuscript emphasises these points at line 96 and 346.

      Work is in progress to explore what LIN code thresholds are generally associated with transmission chains. These will likely be the last 7 thresholds (25, 10, 7, 5, 3, 1, and 0 allelic differences), as previous work has suggested that isolates linked by transmission within one year are associated with <14 single nucleotide polymorphism differences (De Silva et al., 2016). The results of this analysis will be described in a future article, currently in preparation.

      Harrison, O.B., et al. Neisseria gonorrhoeae Population Genomics: Use of the Gonococcal Core Genome to Improve Surveillance of Antimicrobial Resistance. The Journal of Infectious Diseases 2020.

      De Silva, D., et al. Whole-genome sequencing to determine transmission of Neisseria gonorrhoeae: an observational study. The Lancet Infectious Diseases 2016;16(11):1295-1303.

      Reviewer #3 (Recommendations for the authors):

      (1) Data/code availability: While the genomic data and LIN codes are available in PubMLST and new isolates uploaded to PubMLST can be assigned a LIN code, it is also important to have software version numbers reported in the methods section and code/commands associated with the analysis in this manuscript (e.g. generation of core genome, statistical analysis, comparison with other typing methods) documented in a repository like GitHub.

      Software version numbers have been added to the manuscript. Scripts used to run the software have been compiled and documented on protocols.io, DOI: dx.doi.org/10.17504/protocols.io.4r3l21beqg1y/v1

      (2) Line 37: Missing "a" before "multi-drug resistant pathogen".

      This has been corrected in the text.

      (3) Line 60: Typo in geoBURST.

      The text refers to a tool called goeBURST (global optimal eBURST) as described in Francisco, A.P. et al., 2009. DOI: 10.1186/1471-2105-10-152. Therefore, “geoBURST” would be incorrect.

      (4) Line 136-138: It might be helpful to discuss how premature stop codons are treated in this scheme. Often in isolates with alleles containing early premature stop codons, annotation software like prokka will annotate two separate ORFs, which are then clustered with pangenome software like PIRATE. How does the cgMLST scheme proposed here treat premature stop codons? Are sequences truncated at the first stop codon, or is the nucleotide sequence for the entire gene used even if it is out of frame?

      In PubMLST, alleles with premature stop codons are flagged, but otherwise annotated from the typical start to the usual stop codon, if still present. This also applies to frameshift mutations – a new unique allele will be annotated, but flagged as frameshift. In both cases, each new allele with a premature stop codon or frameshift will require human curator involvement to be assigned, to ensure rigorous allele assignment. As the Ng cgMLST v2 scheme prioritised readily auto-annotated genes, loci which are prone to internal stop codons or frameshifts with inconsistent start/end codons are excluded from the scheme. The text has been updated at line 128 to mention this.

      (5) Line 213-214: What were the versions of software and parameters used for phylogenetic tree construction?

      Version numbers have been added to the text between lines 214-219. Parameters have been included with the scripts documented at protocols.io DOI: dx.doi.org/10.17504/protocols.io.4r3l21beqg1y/v1

      (6) Line 249: K. pneumoniae may also be a more diverse/older species than N. gonorrhoeae.

      The text has been updated at line 252-253 to emphasize the difference in diversity. The age of N. gonorrhoeae as a species is a matter of scientific debate, and out of the scope of this paper to discuss.

      (7) Line 278-279: Were some isolates unable to be typed, or have they just been added since the LIN code assignment occurred?

      Some genomes cannot be assigned a LIN code due to poor genome quality. A minimum of 1405/1430 core genes must have an allele designated for a LIN code to be assigned. Genomes with large numbers of contigs may not meet this requirement. LIN code assignment is an ongoing process that occurs on a weekly basis in PubMLST, performed in batches starting at 23:00 (UK local time) on Sundays. The text has been updated to describe this at lines 196 and 282-283.

      (8) Line 314-315: Was BAPS rerun on the dataset used in this manuscript, or is this based on previously assigned BAPS groups?

      This was based on previously assigned BAPs groups, as described between lines 315-320.

      (9) Line 421-423: Are there options for assigning LIN codes that do not require uploading genomes to PubMLST? I can imagine that there may be situations where researchers or public health institutions cannot share genomic data prior to publication.

      Isolate data does not need to be shared to be uploaded and assigned a LIN code in PubMLST. data owners can create a private dataset within PubMLST viewable only to them, on which automated assignment will be performed. LIN code requires a central repository of genomes for new codes to be assigned in relation to. The text has been updated to emphasize this at line 197 and 427.

      (10) Figure 6: How is this tree rooted? Additionally, do isolates that have unannotated LIN codes represent uncommon LIN codes or were those isolates not typed?

      The tree has been left unrooted, as it is being used to visualise the relationships between the isolates rather than to explore ancestry. Detail on what LIN codes have been annotated can be found in the figure legend, which describes that the 21 most common LIN code lineages in this 1000 isolate dataset have been labelled. All 1000 isolates used in the tree had a LIN code assigned, but to ensure good legibility not all lineages were annotated on the tree. The legend has been updated to improve clarity.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      The weaknesses of the study include the following.

      (1)  It remains unclear how CDK is regulated during viral infection and how it specifically recruits E3 ligase to TBK1.

      We would like to express our gratitude to the reviewer for highlighting this significant issue. The present study demonstrates that CDK2 expression is significantly upregulated upon SVCV infection in multiple fish tissues and cell lines (see Fig. 1C-F), thus suggesting that viral infection triggers CDK2 induction. However, the precise upstream signaling pathways that regulate CDK2 during viral infection remain to be fully elucidated. It is hypothesized that viral RNA sensors may activate transcription factors that bind to the cdk2 promoter; however, further investigation is required to confirm this. We have added a sentence in the Discussion (Lines 409-412) acknowledging this as a limitation and a focus for future work, suggesting potential involvement of viral sensor pathways.

      With regard to the mechanism by which CDK2 recruits the E3 ligase Dtx4 to TBK1, evidence is provided that CDK2 directly interacts with both TBK1 (via its kinase domain) and Dtx4 (see Fig. 4F-I, 6A-C). Furthermore, evidence is presented demonstrating that CDK2 enhances the interaction between Dtx4 and TBK1 (Fig. 6D), thus suggesting that CDK2 functions as a scaffold protein to facilitate the formation of a ternary complex. However, further study is required to ascertain the precise structural basis of this interaction, including whether CDK2's kinase activity is required. We have added a note in the Discussion (Lines 417-421) acknowledging this limitation and proposing future structural studies to elucidate the precise binding interfaces.

      (2) The implications and mechanisms for a relationship between the cell cycle and IFN production will be a fascinating topic for future studies.

      We concur with the reviewer's assertion that the interplay between cell cycle progression and innate immunity constitutes a promising and under-explored research domain. Whilst the present study concentrates on the function of CDK2 in antiviral signaling, independent of its cell cycle functions, it is acknowledged that CDK2's activity is cell cycle-dependent. It is hypothesized that CDK2 may function as a molecular link between cell proliferation and immune responses, particularly in light of the observation that viral infections frequently modify host cell cycle progression. In the Discussion (lines 387-391), we now briefly propose a model wherein CDK2 activity during the S phase may suppress TBK1-mediated IFN production to allow viral replication, while CDK2 inhibition (e.g., in G1) may enhance IFN responses. This hypothesis will be the subject of our future work, including cell cycle synchronization experiments and time-course analyses of CDK2 activity and IFN output during infection.

      Reviewer #1 (Recommendations for the authors):

      (1) A control showing that the CDK2 inhibitor blocked kinase activity would be appropriate.

      We thank the reviewer for this suggestion. We have performed experiments using the CDK2-specific inhibitor SNS-032. As shown in the Author response image 1, the treatment of EPC cells with SNS-032 (2 µM) still affect TBK1 expression. However, the selection of this inhibitor was based on literature references (ref. 1 and 2), and it is uncertain whether it directly inhibits the kinase activity of CDK2. However, our result demonstrated that CDK2 retains the capacity to degrade TBK1 even in the absence of its kinase domain (Fig. 6I), yielding outcomes that are consistent with this inhibitor.

      Author response image 1.

      References:

      (1) Mechanism of action of SNS-032, a novel cyclin-dependent kinase inhibitor, in chronic lymphocytic leukemia. Blood. 2009 May 7;113(19):4637-45.

      (2) SNS-032 is a potent and selective CDK 2, 7 and 9 inhibitor that drives target modulation in patient samples. Cancer Chemother Pharmacol. 2009 Sep;64(4):723-32.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewers 1:

      Summary:

      The authors investigated the potential role of IgG N-glycosylation in Haemorrhagic Fever with Renal Syndrome (HFRS), which may offer significant insights for understanding molecular mechanisms and for the development of therapeutic strategies for this infectious disease.

      While the majority of the issues have been addressed, a few minor points still remain unresolved. Quality control should be conducted prior to the analysis of clinical samples. However, the coefficient of variation (CV) value was not provided for the paired acute and convalescent-phase samples from 65 confirmed HFRS patients, which were analyzed to assess inter-individual biological variability. It is important to note that biological replication should be evaluated using general samples, such as standard serum.

      We thank the reviewer for this insightful and critical comment regarding the quality control of our analytical data and the assessment of biological variability. We agree that this is essential for validating the reliability of our findings. We have now provided the requested CV data and clarified this point in the revised manuscript as detailed below.

      "This dual-replicate strategy enabled a comprehensive evaluation of both biological heterogeneity and assay precision, and the coefficient of variation for samples were below 16%." Please see the Materials and Methods (Page 16, lines 360-362, and Author response table 1).

      Author response table 1.

      Comparative analysis of serum biomarker concentrations in acute and convalescent phase cohorts.

      Reviewers 2:

      This work sought to explore antibody responses in the context of hemorrhagic fever with renal syndrome (HFRS) - a severe disease caused by Hantaan virus infection. Little is known about the characteristics or functional relevance of IgG Fc glycosylation in HFRS. To address this gap, the authors analyzed samples from 65 patients with HFRS spanning the acute and convalescent phases of disease via IgG Fc glycan analysis, scRNAseq, and flow cytometry. The authors observed changes in Fc glycosylation (increased fucosylation and decreased bisection) coinciding with a 4-fold or greater increased in Haantan virus-specific antibody titer. The study also includes exploratory analyses linking IgG glycan profiles to glycosylation-related gene expression in distinct B cell subsets, using single-cell transcriptomics. Overall, this is an interesting study that combines serological profiling with transcriptomic data to shed light on humoral immune responses in an underexplored infectious disease. The integration of Fc glycosylation data with single-cell transcriptomic data is a strength.The authors have addressed the major concerns from the initial review. However, one point to emphasize is that the data are correlative. While the associations between Fc glycosylation changes and recovery are intriguing, the evidence does not establish causation. This is not a weakness, as correlative studies can still be highly valuable and informative. However, the manuscript would be strengthened by making this distinction clear, particularly in the title.

      The verb "accelerated" in the title implies that the glycosylation state of IgG was a direct driver of recovery, rather than something that correlated with recovery. Thus, a more neutral word/phrase would be ideal.

      We sincerely thank the reviewer for this insightful suggestion. We agree that the use of "accelerated" might overstate the potential role of IgG glycosylation, which has not been clearly clarified by our current findings. As reported in results (particularly in Figure 2), partial glycosylation exhibits statistically significant variations between seropositive and seronegative statuses, before and after seroconversion, and across different HTNV- NP specific antibody titers. Therefore, we have replaced "accelerated" with "contribute to" in the Title: "Glycosylated IgG antibodies contribute to the recovery of haemorrhagic fever with renal syndrome patients".

    1. Author response:

      Reviewer #1 (Public review):

      The microbiota of Dactylorhiza traunsteineri, an endangered marsh orchid, forms complex root associations that support plant health. Using 16S rRNA sequencing, we identified dominant bacterial phyla in its rhizosphere, including Proteobacteria, Actinobacteria, and Bacteroidota. Deep shotgun metagenomics revealed high-quality MAGs with rich metabolic and biosynthetic potential. This study provides key insights into root-associated bacteria and highlights the rhizosphere as a promising source of bioactive compounds, supporting both microbial ecology research and orchid conservation.  

      The manuscript presents an investigation of the bacterial communities in the rhizosphere of D. traunsteineri using advanced metagenomic approaches. The topic is relevant, and the techniques are up-to-date; however, the study has several critical weaknesses.  

      We thank the reviewer for their careful reading of our manuscript and for the constructive comments. We will revise the manuscript substantially. Our responses to the specific points are below:

      (1) Title: The current title is misleading. Given that fungi are the primary symbionts in orchids and were not analyzed in this study (nor were they included among other microbial groups), the use of the term "microbiome" is not appropriate. I recommend replacing it with "bacteriome" to better reflect the scope of the work.

      In the revised manuscript, we will expand the Results (shotgun sequencing) and Discussion to also include fungal taxa. With these additions, the use of the term microbiome will accurately reflect the inclusion of both bacterial and fungal components.

      (2) Line 124: The phrase "D. traunsteineri individuals were isolated" seems misleading. A more accurate description would be "individuals were collected", as also mentioned in line 128.

      This ambiguity will be corrected in the revised manuscript.

      (3) Experimental design: The major limitation of this study lies in its experimental design. The number of plant individuals and soil samples analyzed is unclear, making it difficult to assess the statistical robustness of the findings. It is also not well explained why the orchids were collected two years before the rhizosphere soil samples. Was the rhizosphere soil collected from the same site and from remnants of the previously sampled individuals in 2018? This temporal gap raises serious concerns about the validity of the biological associations being inferred.

      In the revised manuscript, we will explicitly state the number of individuals and soil samples included in the study, and we will more clearly describe the sequence of sampling events. We will also add a dedicated statement in the Discussion addressing the temporal gap between plant sampling and rhizosphere soil collection, acknowledging that this is a limitation of the study.

      (4) Low sample size: In lines 249-251 (Results section), the authors mention that only one plant individual was used for identifying rhizosphere bacteria. This is insufficient to produce scientifically robust or generalizable conclusions.

      In the revised manuscript, we will clearly state that only one rhizosphere sample was available and will frame the study as exploratory in nature. We will explicitly acknowledge this limitation in both the Methods and Discussion, and we will temper our conclusions accordingly.

      (5) Contextual limitations: Numerous studies have shown that plant-microbe interactions are influenced by external biotic and abiotic factors, as well as by plant age and population structure. These elements are not discussed or controlled for in the manuscript. Furthermore, the ecological and environmental conditions of the site where the plants and soil were collected are poorly described. The number of biological and technical replicates is also not clearly stated.

      In the revised manuscript, we will expand the description of the collection site and environmental conditions to the extent supported by our records. We will also clearly state the number of biological and technical replicates used for each analysis. In the Discussion, we will explicitly acknowledge that plant age, environmental variables, and other biotic/abiotic factors may influence plant–microbe interactions and were not directly assessed in this study.

      (6) Terminology: Throughout the manuscript, the authors refer to the "microbiome," though only bacterial communities were analyzed. This terminology is inaccurate and should be corrected consistently.

      As noted in our response to point (1), we will revise terminology throughout the manuscript to ensure consistency and to accurately reflect the expanded bacterial and fungal coverage in the revised version.

      Reviewer #2 (Public review):

      The authors aim to provide an overview of the D. traunsteineri rhizosphere microbiome on a taxonomic and functional level, through 16S rRNA amplicon analysis and shotgun metagenome analysis. The amplicon sequencing shows that the major phyla present in the microbiome belong to phyla with members previously found to be enriched in rhizospheres and bulk soils. Their shotgun metagenome analysis focused on producing metagenome assembled genomes (MAGs), of which one satisfies the MIMAG quality criteria for high-quality MAGs and three those for medium-quality MAGs. These MAGs were subjected to functional annotations focusing on metabolic pathway enrichment and secondary metabolic pathway biosynthetic gene cluster analysis. They find 1741 BGCs of various categories in the MAGs that were analyzed, with the high-quality MAG being claimed to contain 181 SM BGCs. The authors provide a useful, albeit superficial, overview of the taxonomic composition of the microbiome, and their dataset can be used for further analysis.

      The conclusions of this paper are not well-supported by the data, as the paper only superficially discusses the results, and the functional interpretation based on taxonomic evidence or generic functional annotations does not allow drawing any conclusions on the functional roles of the orchid microbiota.  

      We thank the reviewer for their thoughtful and constructive assessment of our manuscript. The comments have been very helpful in identifying areas where the clarity, structure, and interpretation of our work can be improved. Our responses to the specific points are below:

      (1) The authors only used one individual plant to take samples. This makes it hard to generalize about the natural orchid microbiome.

      We agree with the reviewer that the limited number of plant individuals restricts the generality of the conclusions. In the revised manuscript, we will clearly state that only one rhizosphere sample was available for analysis and will frame the study as exploratory. We will also explicitly acknowledge this limitation in the Discussion and ensure that our interpretations and conclusions remain appropriately cautious.

      (2) The authors use both 16S amplicon sequencing and shotgun metagenomics to analyse the microbiome. However, the authors barely discuss the similarities and differences between the results of these two methods, even though comparing these results may be able to provide further insights into the conclusions of the authors. For example, the relative abundance of the ASVs from the amplicon analysis is not linked to the relative abundances of the MAGs.

      In the revised manuscript, we will expand the Results and Discussion to include a clearer comparison between the taxonomic profiles derived from 16S amplicon sequencing and those obtained from shotgun metagenomic binning.

      (3) Furthermore, the authors discuss that phyla present in the orchid microbiome are also found in other microbiomes and are linked to important ecological functions. However, their results reach further than the phylum level, and a discussion of genera or even species is lacking. The phyla that were found have very large within-phylum functional variability, and reliable functional conclusions cannot be drawn based on taxonomic assignment at this level, or even the genus level (Yan et al. 2017).

      In the revised manuscript, we will incorporate taxonomic discussion at finer resolution where reliable assignments are available. We will also revise the Discussion to avoid overinterpreting phylum-level taxonomy in terms of ecological function.

      (4) Additionally, although the authors mention their techniques used, their method section is sometimes not clear about how samples or replicates were defined. There are also inconsistencies between the methods and the results section, for example, regarding the prediction of secondary metabolite biosynthetic gene clusters (BGCs).

      In the revised Methods section, we will clearly define the number and type of samples included in each analysis, specify the number of replicates and how they were handled, and provide a clearer description of the biosynthetic gene cluster (BGC) prediction workflow, including the tools used and how results were interpreted. 

      (5) The BGC prediction was done with several tools, and the unusually high number of found BGCs (181 in their high-quality MAG) is likely due to false positives or fragmented BGCs. The numbers are much higher than any numbers ever reported in literature supported by functional evidence (Amos et al, 2017), even in a prolific genus like Streptomyces (Belknap et al., 2020). This caveat is not discussed by the authors.

      We thank the reviewer for this important point. Our original intention was to present the BGC predictions as a resource for future exploration, which is why multiple tools were used. However, we understand how this approach may lead to confusion, particularly regarding the confidence level of the predicted clusters and the potential inflation of counts due to assembly fragmentation or tool sensitivity. In the revised manuscript, we will thoroughly revise this section to clearly distinguish highconfidence predictions from more exploratory findings. We will focus on results supported by stronger evidence, explicitly qualify lower-confidence predictions as putative, and temper any functional interpretations accordingly.

      (6) The authors have generated one high-quality MAG and three medium-quality MAGs. In the discussion, they present all four of these as high-quality, which could be misleading. The authors discuss what was found in the literature about the role of the bacterial genera/phyla linked to these MAGs in plant rhizospheres, but they do not sufficiently link their own analysis results (metabolic pathway enrichment and biosynthetic gene cluster prediction) to this discussion. The results of these analyses are only presented in tables without further explanation in either the results section or the discussion, even though there may be interesting findings. For example, the authors only discuss the class of the BGCs that were found, but don't search for experimentally verified homologs in databases, which could shed more light on the possible functional roles of BGCs in this microbiome.

      In the revised manuscript, we will ensure that MAG quality is described accurately and consistently throughout, distinguishing clearly between high-quality and medium-quality bins according to accepted standards.

      (7) In the conclusions, the authors state: "These analyses uncovered potential metabolic capabilities and biosynthetic potentials that are integral to the rhizosphere's ecological dynamics." I don't see any support for this. Mentioning that certain classes of BGCs are present is not enough to make this claim, in my opinion. Any BGC is likely important for the ecological niche the bacteria live in. The fact that rhizosphere bacteria harbour BGCs is not surprising, and it doesn't tell us more than is already known.

      In the revised manuscript, we will rewrite the conclusion to reflect a more cautious interpretation, focusing on the potential metabolic and biosynthetic capabilities suggested by the data without asserting ecological roles that cannot be directly supported. These capabilities will be presented as hypotheses for future investigation rather than established ecological features.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The authors used fluorescence microscopy, image analysis, and mathematical modeling to study the effects of membrane affinity and diffusion rates of MinD monomer and dimer states on MinD gradient formation in B. subtilis. To test these effects, the authors experimentally examined MinD mutants that lock the protein in specific states, including Apo monomer (K16A), ATP-bound monomer (G12V), and ATPbound dimer (D40A, hydrolysis defective), and compared to wild-type MinD. Overall, the experimental results support the conclusion that reversible membrane binding of MinD is critical for the formation of the MinD gradient, but that the binding affinities between monomers and dimers are similar.  

      The modeling part is a new attempt to use the Monte Carlo method to test the conditions for the formation of the MinD gradient in B. subtilis. The modeling results provide good support for the observations and find that the MinD gradient is sensitive to different diffusion rates between monomers and dimers. This simulation is based on several assumptions and predictions, which raises new questions that need to be addressed experimentally in the future. However, the current story is sufficient without testing these assumptions or predictions.

      Reviewer #2 (Public review): 

      Summary:  

      Bohorquez et al. investigate the molecular determinants of intracellular gradient formation in the B. subtilis Min system. To this end, they generate B. subtilis strains that express MinD mutants that are locked in the monomeric or dimeric states, and also MinD mutants with amphipathic helices of varying membrane affinity. They then assess the mutants' ability to bind to the membrane and form gradients using fluorescence microscopy in different genetic backgrounds. They find that, unlike in the E. coli Min system, the monomeric form of MinD is already capable of membrane binding. They also show that MinJ is not required for MinD membrane binding and only interacts with the dimeric form of MinD. Using kinetic

      Monte Carlo simulations, the authors then test different models for gradient formation, and find that a MinD gradient along the cell axis is only formed when the polarly localized protein MinJ stimulates dimerization of MinD, and when the diffusion rate of monomeric and dimeric MinD differs. They also show that differences in the membrane affinity of MinD monomers and dimers are not required for gradient formation.  

      Strengths:  

      The paper offers a comprehensive collection of the subcellular localization and gradient formation of various MinD mutants in different genetic backgrounds. In particular, the comparison of the localization of these mutants in a delta MinC and MinJ background offers valuable additional insights. For example, they find that only dimeric MinD can interact with MinJ. They also provide evidence that MinD locked in a dimer state may co-polymerize with MinC, resulting in a speckled appearance.  

      The authors introduce and verify a useful measure of membrane affinity in vivo.  

      The modulation of the membrane affinity by using distinct amphipathic helices highlights the robustness of the B. subtilis MinD system, which can form gradients even when the membrane affinity of MinD is increased or decreased.  

      Weaknesses:  

      The main claim of the paper, that differences in the membrane affinity between MinD monomers and dimers are not required for gradient formation, does not seem to be supported by the data. The only measure of membrane affinity presented is extracted from the transverse fluorescence intensity profile of cells expressing the mGFP-tagged MinD mutants. The authors measure the valley-to-peak ratio of the profile, which is lower than 1 for proteins binding to the membrane and higher than 1 for cytosolic proteins. To verify this measure of membrane affinity, they use a membrane dye and a soluble GFP, which results in values of ~0.75 and ~1.25, respectively. They then show that all MinD mutants have a value - roughly in the range of 0.8-0.9 - and they use this to claim that there are no differences in membrane affinity between monomeric and dimeric versions.  

      While this way to measure membrane affinity is useful to distinguish between binders and non-binders, it is unclear how sensitive this assay is, and whether it can resolve more subtle differences in membrane affinity, beyond the classification into binders and non-binders. A dimer with two amphipathic helices should have a higher membrane affinity than a monomer with only one such copy. Thus, the data does not seem to support the claim that "the different monomeric mutants have the same membrane affinity as the wildtype MinD". The data only supports the claim that B. subtilis MinD monomers already have a measurable membrane affinity, which is indeed a difference from the E. coli Min system.  

      While their data does show that a stark difference between monomer and dimer membrane affinity may not be required for gradient formation in the B. subtilis case, it is also not prevented if the monomer is unable to bind to the membrane. They show this by replacing the native MinD amphipathic helix with the weak amphipathic helix NS4AB-AH. According to their membrane affinity assay, NS4AB-AH does not bind to the membrane as a monomer (Figure 4D), but when this helix is fused to MinD, MinD is still capable of forming a gradient (albeit a weaker one). Since the authors make a direct comparison to the E. coli MinDE systems, they could have used the E. coli MinD MTS instead or in addition to the NS4AB-AH amphipathic helix. The reviewer suspects that a fusion of the E. coli MinD MTS to B. subtilis MinD may also support gradient formation.  

      The paper contains insufficient data to support the many claims about cell filamentation and minicell formation. In many cases, statements like "did not result in cell filamentation" or "restored cell division" are only supported by a single fluorescence image instead of a quantitative analysis of cell length distribution and minicell frequency, as the one reported for a subset of the data in Figure 5.  

      The paper would also benefit from a quantitative measure of gradient formation of the distinct MinD mutants, instead of relying on individual fluorescent intensity profiles.  

      The authors compare their experimental results with the oscillating E. coli MinDE system and use it to define some of the rules of their Monte Carlo simulation. However, the description of the E. coli Min system is sometimes misleading or based on outdated findings.

      The Monte Carlo simulation of the gradient formation in B. subtilis could benefit from a more comprehensive approach:

      (1) While most of the initial rules underlying the simulation are well justified, the authors do not implement or test two key conditions:

      (a) Cooperative membrane binding, which is a key component of mathematical models for the oscillating E. coli Min system. This cooperative membrane binding has recently been attributed to MinD or MinCD oligomerization on the membrane and has been experimentally observed in various instances; in fact, the authors themselves show data supporting the formation of MinCD copolymers.  

      (2) Local stimulation of the ATPase activity of MinD which triggers the dimer-to-monomer transition; E. coli MinD ATP hydrolysis is stimulated by the membrane and by MinE, so B. subtilis MinD may also be stimulated by the membrane and/or other components like MinJ. Instead, the authors claim that (a) would only increase differences in diffusion between the monomer and different oligomeric species, and that a 2-fold increase in dimerization on the membrane could not induce gradient formation in their simulation, in the absence of MinJ stimulating gradient formation. However, a 2-fold increase in dimerization is likely way too low to explain any cooperative membrane binding observed for the E. coli Min system. Regarding (b), they also claim that implementing stimulation of ATP hydrolysis on the membrane (dimer-to-monomer transition) would not change the outcome, but no simulation result for this condition is actually shown.  

      (3) To generate any gradient formation, the authors claim that they would need to implement stimulation of dimer formation by MinJ, but they themselves acknowledge the lack of any experimental evidence for this assertion. They then test all other conditions (e.g., differences in membrane affinity, diffusion, etc.) in addition to the requirement that MinJ stimulates dimer formation. It is unclear whether the authors tested all other conditions independently of the "MinJ induces dimerization" condition, and whether either of those alone or in combination could also lead to gradient formation. This would be an important test to establish the validity of their claims.

      Reviewer #3 (Public review): 

      This important study by Bohorquez et al examines the determinants necessary for concentrating the spatial modulator of cell division, MinD, at the future site of division and the cell poles. Proper localization of MinD is necessary to bring the division inhibitor, MinC, in proximity to the cell membrane and cell poles where it prevents aberrant assembly of the division machinery. In contrast to E. coli, in which MinD oscillates from pole to pole courtesy of a third protein MinE, how MinD localization is achieved in B. subtilis - which does not encode a MinE analog - has remained largely a mystery. The authors present compelling data indicating that MinD dimerization is dispensable for membrane localization but required for concentration at the cell poles. Dimerization is also important for interactions between MinD and MinC, leading to the formation of large protein complexes. Computational modeling, specifically a Monte Carlo simulation, supports a model in which differences in diffusion rates between MinD monomers and dimers lead to the concentration of MinD at cell poles. Once there, interaction with MinC increases the size of the complex, further reinforcing diffusion differences. Notably, interactions with MinJ-which has previously been implicated in MinCD localization, are dispensable for concentrating MinD at cell poles although MinJ may help stabilize the MinCD complex at those locations.  

      Reviewer #1 (Recommendations for the authors):  

      (1) The title could be modified to better reflect the emphasis on MinD monomer and dimer diffusion rather than the fact that membrane affinity is not important in MinD gradient formation. In addition, because membrane association requires affinity for the membrane, this title seems inconsistent with statements in the main text, such as Lines 246-247: a reversible membrane association is important for the formation of a MinD gradient along the cell axis.

      We agree with the reviewer that the title can be more accurate, and we have now changed it to “Membrane affinity difference between MinD monomer and dimer is not crucial to MinD gradient formation in Bacillus subtilis”

      (2) This paper reports that the difference in diffusion rates between MinD monomers and dimers is an important factor in the formation of Bs MinD gradients. However, one can argue for the importance of MinD monomers in the cellular context. Since the abundance of ATP in cells often far exceeds the abundance of MinD protein molecules under experimental conditions, MinD can easily form dimers in the cytoplasm. How does the author address this problem?  

      It is a good point that ATP concentration in the cell likely favours dimers in the cytoplasm. However, what is important in our model is that there is cycling between monomer and dimer, rather than where exactly this happen. In fact, the gradients works essentially equally well if dimers can become monomers only whilst they are at the membrane, as we have mentioned in the manuscript (lines 324-326 in the original manuscript). However, in the original manuscript this simulation was not shown, and now we have included this in the new Fig. 8D & E.

      (3)The claim "This oscillating gradient requires cycling of MinD between a monomeric cytosolic and a dimeric membrane attached state." (Lines 46, 47) is not well supported by most current studies and needs to be revised since to my knowledge, most proposed models do not consider the monomer state. The basic reaction steps of Ec Min oscillations include ATP-bound MinD dimers attaching to the membrane that subsequently recruit more MinD dimers and MinE dimers to the membrane; MinE interactions stimulate ATP hydrolysis in MinD, leading to dissociation of ADP-bound MinD dimers from the membrane; nucleotide exchange occurs in the cytoplasm.  

      Here the reviewer refers to a sentence in a short “Importance” abstract that we have added. In fact, such abstract is not necessary, so we have removed it. Of note, the E. coli MinD oscillation, including the role of MinE, is described in detail in the Introduction. 

      A recent reference is a paper by Heermann et al. (2020; doi: 10.1016/j.jmb.2020.03.012), which considers the MinD monomer state, which is not mentioned in this work. How do their observations compare to this work?  

      The Heermann paper mentions that MinD bound to the membrane displays an interface for multimerization, and that this contributes to the local self-enhancement of MinD at the membrane. In our Discussion, we do mention that E. coli MinD can form polymers in vitro and that any multimerization of MinD dimers will further increase the diffusion difference between monomer and dimer, and might contribute to the formation of a protein gradient (lines 459-467). We have now included a reference to the Heermann paper (line 461).

      (4) Throughout the manuscript, errors in citing references were found in several places.                 

      We have corrected this where suggested.

      (5) The introduction may be somewhat misleading due to mixed information from experimental cellular results, in vitro reconstructions, and theoretical models in cells or in vitro environments. Some models consider space constraints, while others do not. Modifications are recommended to clarify differences.  

      See below for responses 

      (6) The citation for MinD monomers:

      The paper by Hu and Lutkenhaus (2003, doi: 10.1046/j.1365-2958.2003.03321.x.) contains experimental evidence showing monomer-dimer transition using purified proteins. Another paper by the same laboratory (Park et al. 2012, doi: 10.1111/j.1365-2958.2012.08110.x.) explained how ATP-induced dimerization, but this paper is not cited.  

      The Park et al. 2012 paper focusses at the asymmetric activation of MinD ATPase by MinE, which goes beyond the scope of our work. However, we have cited several other papers from the Lutkenhaus lab, including the Wu et al. 2011 paper describing the structure of the MinD-ATP complex.

      Other evidence comes from structural studies of Archaea Pyrococcus furiosus (1G3R) and Pyrococcus horikoshii (1ION), and thermophilic Aquifex aeolicus (4V01, 4V02, 4V03). As they may function differently from Ec MinD, they are less relevant to this manuscript.

      We agree. 

      (7) Lines 65, 66: Using the term 'a reaction-diffusion couple' to describe the biochemical facts by citing references of Hu and Lutkenhaus (1999) and Raskin and de Boer (1999) is not appropriate. The idea that the Min system behaves as a reaction-diffusion system was started by Howard et al. (2001), Meinhardt and de Boer (2001), and Huang et al. (2003) et al. In addition, references for MinE oscillation are missing. 

      We have now corrected this (line 52).

      (8) Lines 77-79: Citations are incorrect.

      ATP-induced dimerization: Hu and Lutkenhaus (2003, DOI: 10.1046/j.1365-2958.2003.03321.x), Park et al. (2012). C-terminal amphipathic helix formation: Szeto et al. (2003), Hu and Lutkenhaus (2003, DOI: 10.1046/j.1365-2958.2003.03321.x).

      Citations have been corrected.

      (9) Line 78: The C-terminal amphipathic helix is not pre-formed and then exposed upon conformational change induced by ATP-binding. This alpha-helical structure is an induced fold upon interaction with membranes as experimentally demonstrated by Szeto et al. (2003).  

      We have adjusted the text to correct this (lines 64-66).

      (10) Line 102: 'cycles between membrane association and dissociation of MinD' also requires MinE in addition to ATP.

      We believe that in the context of this sentence and following paragraph it is not necessary to again mention MinE, since it is focused on parallels between the E. coli and B. subtilis MinD membrane binding cycles.

      (11) In the introduction, could the author briefly explain to a general audience the difference between Monte Carlo and reaction-diffusion methods? How do different algorithms affect the results?

      The main difference between the kinetic Monte Carlo and typical reaction-diffusion methods which is relevant to our work is that the first is particle-based, and naturally includes statistical fluctuations (noise), whereas the second is field-based, and is in the normal implementation deterministic, so does not include noise. Whilst it should be noted that one can in principle include noise in the field-based reactiondiffusion methods, this is done rarely. Additionally, although we do not do this here, the kinetic MonteCarlo can also account, in principle, for particle shape (sphere versus rod), or for localized interactions (as sticky patches on the surface): therefore the kinetic Monte Carlo is more microscopic in nature. We have now shortly described the difference in lines 102-105.

      (12)  Lines 126-128: The second part of the sentence uses the protein structure of Pyrococcus furiosus MinD (Ref 37) to support a protein sequence comparison between Ec and Bs MinD. However, the structure of the dimeric E. coli MinD-ATP complex (3Q9L) is available, which is Reference 38 that is more suited for direct comparison.

      To discuss monomeric MinD from P. furiosus, it will be useful to include it in the primary sequence alignment in Figure S1.

      We do not think that this detailed information is necessary to add to Figure S1, since the mutants have been described before (appropriate citations present in the text).

      (13) Lines 127, 166: Where Figure S1 is discussed, a structural model of MinD will be useful alongside with the primary sequence alignment.

      We do not think that this detailed information is necessary to understand the experiments since the mutants have been described before.

      (14) Lines 131-132: Reference is missing for the sentence of " the conserved..."; Reference 38.  In Reference 38, there is no experimental evidence on G12 but inferred from structure analysis. Reference 26 discusses ATP and MinE regulation on the interactions between MinD and phospholipid bilyers; not about MinD dimerization.

      We have corrected this and added the proper references. 

      For easy reading, the mutant MinD phenotypes can be indicated here instead of in the figure legends, including K16A (apo monomer), MinD G12V (ATP-bound monomer), and MinD D40A (ATP-bound dimer, ATP hydrolysis deficient).  

      We have added the suggested descriptions of the mutants in the main text.

      (15) Lines 150-151: Unlike Ec MinD, which forms a clear gradient in one half of the cell, Bs MinD (wild type) mainly accumulates at the hemispheric poles. What percentage of a cell (or cell length) can be covered by the Bs MinD gradient? How does the shaded area in the longitudinal FIP compare to the area of the bacterial hemispherical pole? If possible, it might be interesting to compare with the range of nucleoid occlusion mechanisms that occur.

      Part of the MinD gradient covers the nucleoid area, since the fluorescence signal is still visible along the cell lengths, yet there is no sudden drop in fluorescence, suggesting that nucleoid exclusion does not play a role.

      (16)  Line 160: In addition to summarizing the membrane-binding affinity, descriptions of the differences in the gradient distribution or formation will be useful.  

      We have done this in lines 155-156 of the original manuscript: “The monomeric ATP binding G12V variant shows the same absence of a protein gradient as the K16A variant”.

      (17) Line 262: 'distribution' is not shown.  

      We do not understand this remark. This information is shown in Fig. 5B (now Fig. 6B).

      (18)  Line 287: Wrong citation for reference 31.

      Reference has been corrected.

      (19)  Line 288 and lines 596 regarding the Monte Carlo simulation:

      (a)  An illustration showing the reaction steps for MinD gradient formation will help understand the rationale and assumptions behind this simulation.

      We have added an illustration depicting the different modelling steps in the new Fig. 8.

      (b)  Equations are missing.

      (c)   A table summarizing the parameters used in the simulation and their values.

      (d)  For general readers, it will be helpful to convert the simulation units to real units.

      (e)  Indicate real experimental data with a citation or the reason for any speculative value.

      The Methods section provides a discussion of all parameters used in the potentials on which our kinetic Monte-Carlo algorithm is based. We have now also provided a Table in the SI (Table S1) with typical parameter values in both simulation units and real units. The experimental data and reasoning behind the values chosen are discussed in the Methods section (see “Kinetic Monte Carlo simulation”).

      (20)  Lines 320-321: Reference missing.

      The interaction between MinJ and the dimer form of MinD is based on our findings shown in the original Fig. S4, and this information has not been published before. We have rephrased the sentence to make it more clear. Of note, Fig. S4 has been moved to the main manuscript, at the request of reviewer #2, and is now new Fig. 2. 

      (21)  Lines 355-359: Is the statement specifically made for the Bs Min system? Is there any reference for the statement? Isn't the differences in diffusion rates between molecules 'at different locations' in the system more important than reducing their diffusion rates alone? It is unclear about the meaning of the statement "the Min system uses attachment to the membrane to slow down diffusion". Is this an assumption in the simulation?

      The statement is generic, however the reviewer has a good point and we have made this statement more clear by changing “considerably reduced diffusion rate” to “locally reduced diffusion rate” (line 359).

      (22) Line 403: Citation format.

      We have corrected the text and citation.

      (23) Lines 442-444: The parameters are not defined anywhere in the manuscript.

      Discussed in the M&M and in the new Table S1.

      (24) Lines 464-465: Regarding the final sentence, what does 'this prediction' refer to? Hasn't the author started with experimental observations, predicted possible factors of membrane affinity and diffusion rates, and used the simulation approach to disapprove or support the prediction?

      We have changed “prediction” to “suggestion”, to make it clear that it is related to the suggestion in the previous sentence that  “our modelling suggests that stimulation of MinD-dimerization at cell poles and cell division sites is needed.” (line 471).

      (25) Materials and Methods: Statistical methods for data analyses are missing.

      Added to “Microscopy” section.

      (26) References: References 34, 40, 51 are incomplete.

      References 34 and 40 have been corrected. Reference 51 is a book.

      (27)  Figures: The legends (Figures 1-7) can be shortened by removing redundant details in Material and Methods. Make sure statistical information is provided. The specific mutant MinD states, including Apo monomer, ATP-bound dimer, ATP hydrolysis deficient, and non-membrane binding etc can be specified in the main text. They are repeated in the legends of Figures 1 and 2.

      We have removed redundant details from the legends and provided statistical information.

      (28)  Supporting information:

      Table S1: Content of the acknowledgment statement may be moved to materials and methods and the acknowledgment section. Make sure statistical information is provided in the supporting figure legends.

      We are not sure what the reviewer means with the content acknowledgement in Table S1 (now Table S2). Statistical information has been added.

      Figure S1. Adding a MinD structure model will be useful.

      We do not think that a structural model will enlighten our results since our work is not focused at structural mutagenesis. The mutants that we use have been described in other papers that we have cited.

      Reviewer #2 (Recommendations for the authors):  

      The authors should cite and relate their data to the preprint by Feddersen & Bramkamp, BioRxiv 2024. ATPase activity of B. subtilis MinD is activated solely by membrane binding.

      We have now discussed this paper in relation to our data in lines 407-409. 

      I am not convinced the authors are able to make the statement in lines 160-161 based on their assay: "This confirmed that the different monomeric mutants have the same membrane affinity as wild-type MinD". It is unclear if measuring valley-to-peak ratios in their longitudinal profiles can resolve small differences in membrane affinity. Wildtype MinD should at least be dimeric, or (as the authors also note elsewhere) may even be present in higher-order structures and as such have a higher membrane affinity than a monomeric MinD mutant. The authors should rephrase the corresponding sections in the manuscript to state that the MinD monomer already has detectable membrane affinity, instead of stating that the monomer and dimer membrane affinity are the same.

      We agree that “the same affinity” is too strongly worded, and we have now rephrased this by saying that the different monomeric mutants have a comparable membrane affinity as wild type MinD (line 152).

      According to the authors' analysis, MinD-NS4B would not bind to the membrane as it has a valley-to-peak ratio higher than 1, similar to the soluble GFP. However, the protein is clearly forming a gradient, and as such probably binding to the membrane. The authors should discuss this as a limitation of their membrane binding measure.

      The ratio value of 1 is not a cutoff for membrane binding. As shown in Fig. 1F, GFP has a valley-topeak ratio close to 1.25, whereas the FM5-95 membrane dye has a ratio close to 0.75. In Fig. 3C (now Fig. 4C) we have shown that GFP fused with the NS4B membrane anchor has a lower ratio than free GFP, and we have shown the same in Fig. 4D (now Fig. 5D) for GFP-MinD-NS4B. The difference are small but clear, and not similar to GFP.

      The observation that MinD dimers are localized by MinJ is interesting and key to the rule of the Monte Carlo simulation that dimers attach to MinJ. However, the data is hidden in the supplementary information and is not analysed as comprehensively, e.g., it lacks the analysis of the membrane binding. The paper would benefit from moving the fluorescence images and accompanying analysis into the main text.  

      We have moved this figure to the main text and added an analysis of the fluorescence intensities (new Fig. 2).

      The authors should show the data for cell length and minicell formation, not only for the MinDamphipathic helix versions (Fig. 5), but also for the GFP-MinD, and all the MinD mutants. They do refer to some of this data in lines 145-148 but do not show it anywhere. They also refer to "did not result in cell filamentation" in line 213 and to "resulted in highly filamentous cells" and "Introduction of a minC deletion restored cell division" in lines 167-160 without showing the cell length and minicell data, but instead refer to the fluorescence image of the respective strain. I would suggest the authors include this data either in a subpanel in the respective figure or in the supplementary information.

      The effect of uncontrolled MinC activity is very apparent and leads to long filamentous cells. Also the occurrence of minicells is apparent. Cell lengths distribution of wild type cells is shown in Fig. 6B, and minicell formation is negligibly small in wild type cells.

      The transverse fluorescence intensity profiles used as a measure for membrane binding are an average profile from ~30 cells. In the case of the longitudinal profiles that display the gradient, only individual profiles are displayed. I understand that because of distinct cell length, the longitudinal profiles cannot simply be averaged. However, it is possible to project the profiles onto a unit length for averaging (see for example the projection of profiles in McNamara. et al., BioRxiv (2023)). It would be more convincing to average these profiles, which would allow the authors to also quantify the gradients in more detail. If that is impossible, the authors may at least quantify individual valley-to-peak ratios of the longitudinal fluorescence profiles as a measure of the gradient.

      We agree that in future work it would be better to average the profiles as suggested. However, due to limited time and resources, we cannot do this for the current manuscript.

      Regarding the rules and parameters used for the Monte Carlo simulation (see also the corresponding sections in the public review):

      (1) The authors mention that they have not included multimerization of MinD in their simulation but argue in the discussion that it would only strengthen the differences in the diffusion between monomers and multimers. This is correct, but it may also change the membrane residence time and membrane affinity drastically.

      Simulation of multimerization is difficult, but we have now included a simulation whereby MinD dimers can also form tetramers (lines 341-348), shown in the new Fig. 8K. This did not alter the MinD gradient much. 

      (2) The authors implement a dimer-to-monomer transition rate that they equate with the stochastic ATP hydrolysis rate occurring with a half-life of approximately 1/s (line 305). They claim that this rate is based on information from E. coli and cite Huang and Wingreen. However, the Huang paper only mentions the nucleotide exchange rate from ADP to ATP at 1/s. Later that paper cites their use of an ATP hydrolysis rate of 0.7/s to match the E. coli MinDE oscillation rate of 40s. From the authors' statement, it is unclear to me whether they refer to the actual ATP hydrolysis rate in Huang and Wingreen or something else. For E. coli MinD, both the membrane and MinE stimulate ATPase activity. Even if B. subtilis lacks MinE, ATP hydrolysis may still be stimulated by the membrane, which has also been reported in another preprint (Feddersen & Bramkamp, BioRxiv 2024). It may also be stimulated by other components of the Min system like MinJ. The authors should include in the manuscript the Monte Carlo simulation implementing dimer to monomer transition on the membrane only, which is currently referred to only as "(data not shown)". 

      The exact value of the ATP hydrolysis rate is not so important here, so 1/s only gives the order of magnitude (in line with 0.7/s above), which we have now clarified in lines 631-632. We have now also added the “(data not shown” results to Fig. 8, i.e. simulations where dimer to monomer transitions (i.e. ATPase activity) only occurs at the membrane (Fig. 8D & E, and lines 319-322).

      (3) How long did the authors simulate for? How many steps? What timesteps does the average pictured in Figure 7 correspond to?

      We simulated 10^7time steps (corresponding to 100 s in real time). We have checked that the simulation steps for which we average are in steady state. Typical snapshots are recorded after 10^610^7time steps, when the system is in steady state. We have added this information in lines 299-300.

      There are several misconceptions about the (oscillating E. coli) Min system in the main text:

      (1) Lines 77-78: "In case of the E. coli MinD, ATP binding leads to dimerization of MinD, which induces a conformational change in the C-terminal region, thereby exposing an amphiathic helix that functions as a membrane binding domain" and "This shows a clear difference with the E. coli situation, where dimerization of MinD causes a conformational change of the C-terminal region enabling the amphipathic helix to insert into the lipid bilayer" in lines 400-403 are incorrect. There is no evidence that the amphipathic helix at the C-terminus of MinD changes conformation upon ATP binding; several studies have shown instead that a single copy of the amphipathic helix is too weak to confer efficient membrane binding but that the dimerization confers increased membrane binding as now two amphipathic helices are present leading to an avidity effect in membrane binding. Please refer to the following papers (Szeto et al., JBC (2003); Wu et al., Mol Microbiol (2011); Park et al., Cell (2011); Heermann et al., JMB (2020); Loose et al., Nat Struct Mol Biol (2011); Kretschmer et al., ACS Syn Biol (2021); Ramm et al., Nat Commun (2018) or for a better overview the following reviews on the topic of the E. coli Min system Wettmann and Kruse, Philos Trans R Soc B Biol (2018), Ramm et al., Cell and Mol Life Sci (2019); Halatek et al., Philos Trans R SocB Biol Sci (2018).

      This is indeed incorrectly formulated, and we have now amended this in lines 64-66 and lines 403406. Key papers are cited in the text.

      (2) The authors mention that E. coli MinD may multimerize, citing a study where purified MinD was found to polymerize, and then suggest that this is unlikely to be the case in B. subtilis as FRAP recovery of MinD is quick. However, cooperativity in membrane binding is essential to the mathematical models reproducing E. coli Min oscillations, and there is more recent experimental evidence that E. coli MinD forms smaller oligomers that differ in their membrane residence time and diffusion (e.g., Heermann et al., Nat Methods (2023); Heermann et al., JMB (2020);) I would suggest the authors revise the corresponding text sections and test the multimerization in their simulation (see above).

      As mentioned above, simulating oligomerization is difficult, but in order to approximate related cooperative effects, we have simulated a situation whereby MinD dimers can form tetramers. This simulation did not show a large change in MinD gradient formation. We have added the result of this simulation to Fig. 8 (Fig. 8K), and discuss this further in lines 341-348 and 459-467.

      (3) Lines 75-76 and lines 79-80: The sentences "MinC ... and needs to bind to the Walker A-type ATPase MinD for its activity" and "The MinD dimer recruits MinC ... and stimulates its activity" are misleading. MinC is localized by MinD, but MinD does not alter MinC activity, as MinC mislocalization or overexpression also prevents FtsZ ring formation leading to minicell or filamentous cells, as also later described by the authors (line 98). There is also no biochemical evidence that the presence of MinD somehow alters MinC activity towards FtsZ other than a local enrichment on the membrane. I would rephrase the sentence to emphasize that MinD is only localizing MinC but does not alter its activity.   

      We have rephrased this sentence to prevent misinterpretation (lines 66-67).

      Minor points:  

      (1)  I am not quite sure what the experiment with the CCCP shows. The authors explain that MinD binding via the amphipathic helix requires the presence of membrane potential and that the addition of CCCP disturbs binding. They then show that the MinD with two amphipathic helices is not affected by CCCP but the wildtype MinD is. What is the conclusion of this experiment? Would that mean that the MinD with two amphipathic helices binds more strongly, very differently, perhaps non-physiologically?  

      This experiment was “To confirm that the tandem amphipathic helix increased the membrane affinity of MinD”, as mentioned in the beginning of the paragraph (line 224).  

      (2) Lines 456-457: Please cite the FRAP experiment that shows a quick recovery rate of MinD.

      Reference has been added. 

      (3) Figure 4D: It is unclear to me to which condition the p-value brackets point.

      This is related to a statistical t-test. We have added this information to the legend of the figure.

      (4) Line 111, "in the membrane affinity of the MinD". I think that the "the" before MinD should be removed.  

      Corrected

      (5) Typo in line 199 "indicting" instead of indicating.

      Corrected

      (6) Typo in line 220 "reversable" instead of reversible.

      Corrected

      (7) Lines 279, 284, 905: "Monte-Carlo" should read Monte Carlo.

      Corrected

      Reviewer #3 (Recommendations for the authors):  

      Introduction: As written, the introduction does not provide sufficient background for the uninitiated reader to understand the function of the MinCD complex in the context of assembly and activation of cell division in B. subtilis. The introduction is also quite long and would benefit from condensing the description of the Min oscillation mechanism in E. coli to one or two sentences. While highlighting the role of MinE in this system is important for understanding how it works, it is only needed as a counterpoint to the situation in B. subtilis.

      Since the Min system of E. coli is by far the best understood Min system, we feel that it is important to provide detailed information on this system. However, we have added an introductory sentence to explain the key function of the Min system (line 46-48).

      Line 248: Increasing MinD membrane affinity increases the frequency of minicells - however it is unclear if cells are dividing too much or if it is just a Min mutant (i.e. occasionally dividing at the cell pole vs the middle)? Cell length measurements should be included to clarify this point (Figures 4 and 5).

      This information is presented in Fig. 5B (Cell length distribution), which is now Fig. 6B, indicating that the average cell length increases in the tandem alpha helix mutant, a phenotype that is comparable to a MinD knockout. 

      Figure 5: I am a bit confused as to whether increasing MinD affinity doesn't lead to a general block in division by MinCD rather than phenocopying a minD null mutant.

      Although the tandem alpha helix mutant has a cell length distribution comparable to a minD knockout, the tandem mutant produces much less minicells then the minD knockout, indicating that there is still some cell division regulation.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This is an excellent study by a superb investigator who discovered and is championing the field of migrasomes. This study contains a hidden "gem" - the induction of migrasomes by hypotonicity and how that happens. In summary, an outstanding fundamental phenomenon (migrasomes) en route to becoming transitionally highly significant.

      Strengths:

      Innovative approach at several levels. Migrasomes - discovered by Dr Yu's group - are an outstanding biological phenomenon of fundamental interest and now of potentially practical value.

      Weaknesses:

      I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.

      We sincerely thank the reviewer for the encouraging and insightful comments. We fully agree that the fundamental aspects of migrasome biology are of great importance and deserve deeper exploration.

      In line with the reviewer’s suggestion, we have expanded our discussion on the basic biology of engineered migrasomes (eMigs). A recent study by the Okochi group at the Tokyo Institute of Technology demonstrated that hypoosmotic stress induces the formation of migrasome-like vesicles, involving cytoplasmic influx and requiring cholesterol for their formation (DOI: 10.1002/1873-3468.14816, February 2024). Building on this, our study provides a detailed characterization of hypoosmotic stressinduced eMig formation, and further compares the biophysical properties of natural migrasomes and eMigs. Notably, the inherent stability of eMigs makes them particularly promising as a vaccine platform.

      Finally, we would like to note that our laboratory continues to investigate multiple aspects of migrasome biology. In collaboration with our colleagues, we recently completed a study elucidating the mechanical forces involved in migrasome formation (DOI: 10.1016/j.bpj.2024.12.029), which further complements the findings presented here.

      Reviewer #2 (Public review):

      Summary:

      The authors' report describes a novel vaccine platform derived from a newly discovered organelle called a migrasome. First, the authors address a technical hurdle in using migrasomes as a vaccine platform. Natural migrasome formation occurs at low levels and is labor intensive, however, by understanding the molecular underpinning of migrasome formation, the authors have designed a method to make engineered migrasomes from cultured, cells at higher yields utilizing a robust process. These engineered migrasomes behave like natural migrasomes. Next, the authors immunized mice with migrasomes that either expressed a model peptide or the SARSCoV-2 spike protein. Antibodies against the spike protein were raised that could be boosted by a 2nd vaccination and these antibodies were functional as assessed by an in vitro pseudoviral assay. This new vaccine platform has the potential to overcome obstacles such as cold chain issues for vaccines like messenger RNA that require very stringent storage conditions.

      Strengths:

      The authors present very robust studies detailing the biology behind migrasome formation and this fundamental understanding was used to form engineered migrasomes, which makes it possible to utilize migrasomes as a vaccine platform. The characterization of engineered migrasomes is thorough and establishes comparability with naturally occurring migrasomes. The biophysical characterization of the migrasomes is well done including thermal stability and characterization of the particle size (important characterizations for a good vaccine).

      Weaknesses:

      With a new vaccine platform technology, it would be nice to compare them head-tohead against a proven technology. The authors would improve the manuscript if they made some comparisons to other vaccine platforms such as a SARS-CoV-2 mRNA vaccine or even an adjuvanted recombinant spike protein. This would demonstrate a migrasome-based vaccine could elicit responses comparable to a proven vaccine technology. 

      We thank the reviewer for the thoughtful evaluation and constructive suggestions, which have helped us strengthen the manuscript. 

      Comparison with proven vaccine technologies:

      In response to the reviewer’s comment, we now include a direct comparison of the antibody responses elicited by eMig-Spike and a conventional recombinant S1 protein vaccine formulated with Alum. As shown in the revised manuscript (Author response image 1), the levels of S1-specific IgG induced by the eMig-based platform were comparable to those induced by the S1+Alum formulation. This comparison supports the potential of eMigs as a competitive alternative to established vaccine platforms. 

      Author response image 1.

      eMigrasome-based vaccination showed similar efficacy compared with adjuvanted recombinant spike protein The amount of S1-specific IgG in mouse serum was quantified by ELISA on day 14 after immunization. Mice were either intraperitoneally (i.p.) immunized with recombinant Alum/S1 or intravenously (i.v.) immunized with eM-NC, eM-S or recombinant S1. The administered doses were 20 µg/mouse for eMigrasomes, 10 µg/mouse (i.v.) or 50 µg/mouse (i.p.) for recombinant S1 and 50 µl/mouse for Aluminium adjuvant.

      Assessment of antigen integrity on migrasomes:

      To address the reviewer’s suggestion regarding antigen integrity, we performed immunoblotting using antibodies against both S1 and mCherry. Two distinct bands were observed: one at the expected molecular weight of the S-mCherry fusion protein, and a higher molecular weight band that may represent oligomerized or higher-order forms of the Spike protein (Figure 5b in the revised manuscript).

      Furthermore, we performed confocal microscopy using a monoclonal antibody against Spike (anti-S). Co-localization analysis revealed strong overlap between the mCherry fluorescence and anti-Spike staining, confirming the proper presentation and surface localization of intact S-mCherry fusion protein on eMigs (Figure 5c in the revised manuscript). These results confirm the structural integrity and antigenic fidelity of the Spike protein expressed on eMigs.

      Recommendations for the authors

      Reviewer #1 (Recommendations For The Authors):

      I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.

      I know that the reviewers always ask for more, and this is not the case here. Can the abstract and title be changed to emphasize the science behind migrasome formation, and possibly add a few more fundamental aspects on how hypotonic shock induces migrasomes?

      Alternatively, if the authors desire to maintain the emphasis on vaccines, can immunological mechanisms be somewhat expanded in order to - at least to some extent - explain why migrasomes are a better vaccine vehicle?

      One way or another, this reviewer is highly supportive of this study and it is really up to the authors and the editor to decide whether my comments are of use or not.

      My recommendation is to go ahead with publishing after some adjustments as per above.

      We’d like to thank the reviewer for the suggestion. We have changed the title of the manuscript and modified the abstract, emphasizing the fundamental science behind the development of eMigrasome. To gain some immunological information on eMig illucidated antibody responses, we characterized the type of IgG induced by eM-OVA in mice, and compared it to that induced by Alum/OVA. The IgG response to Alum/OVA was dominated by IgG1. Quite differently, eM-OVA induced an even distribution of IgG subtypes, including IgG1, IgG2b, IgG2c, and IgG3 (Figure 4i in the revised manuscript). The ratio between IgG1 and IgG2a/c indicates a Th1 or Th2 type humoral immune response. Thus, eM-OVA immunization induces a balance of Th1/Th2 immune responses.

      Reviewer #2 (Recommendations For The Authors):

      The study is a very nice exploration of a new vaccine platform. This reviewer believes that a more head-to-head comparison to the current vaccine SARS-CoV-2 vaccine platform would improve the manuscript. This comparison is done with OVA antigen, but this model antigen is not as exciting as a functional head-to-head with a SARS-CoV-2 vaccine.

      I think that two other discussion points should be included in the manuscript. First, was the host-cell protein evaluated? If not, I would include that point on how issues of host cell contamination of the migrasome could play a role in the responses and safety of a vaccine. Second, I would discuss antigen incorporation and localization into the platform. For example, the full-length spike being expressed has a native signal peptide and transmembrane domain. The authors point out that a transmembrane domain can be added to display an antigen that does not have one natively expressed, however, without a signal peptide this would not be secreted and localized properly. I would suggest adding a discussion of how a non-native signal peptide would be necessary in addition to a transmembrane domain.

      We thank the reviewer for these thoughtful suggestions and fully agree that the points raised are important for the translational development of eMig-based vaccines.

      (1) Host cell proteins and potential immunogenicity:

      We appreciate the reviewer’s suggestion to consider host cell protein contamination. Considering potential clinical application of eMigrasomes in the future, we will use human cells with low immunogenicity such as HEK-293 or embryonic stem cells (ESCs) to generate eMigrasomes. Also, we will follow a QC that meets the standard of validated EV-based vaccination techniques. 

      (2) Antigen incorporation and localization—signal peptide and transmembrane domain:

      We also agree with the reviewer’s point that proper surface display of antigens on eMigs requires both a transmembrane domain and a signal peptide for correct trafficking and membrane anchoring. For instance, in the case of full-length Spike protein, the native signal peptide and transmembrane domain ensure proper localization to the plasma membrane and subsequent incorporation into eMigs. In case of OVA, a secretary protein that contains a native signal peptide yet lacks a transmembrane domain, an engineered transmembrane domain is required. For antigens that do not naturally contain these features, both a non-native signal peptide and an artificial transmembrane domain are necessary. We have clarified this point in the revised discussion and explicitly noted the requirement for a signal peptide when engineering antigens for surface display on migrasomes.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary

      This work performed Raman spectral microscopy at the single-cell level for 15 different culture conditions in E. coli. The Raman signature is systematically analyzed and compared with the proteome dataset of the same culture conditions. With a linear model, the authors revealed correspondence between Raman pattern and proteome expression stoichiometry indicating that spectrometry could be used for inferring proteome composition in the future. With both Raman spectra and proteome datasets, the authors categorized co-expressed genes and illustrated how proteome stoichiometry is regulated among different culture conditions. Co-expressed gene clusters were investigated and identified as homeostasis core, carbon-source dependent, and stationary phase-dependent genes. Overall, the authors demonstrate a strong and solid data analysis scheme for the joint analysis of Raman and proteome datasets.

      Strengths and major contributions

      (1) Experimentally, the authors contributed Raman datasets of E. coli with various growth conditions.

      (2) In data analysis, the authors developed a scheme to compare proteome and Raman datasets. Protein co-expression clusters were identified, and their biological meaning was investigated.

      Weaknesses

      The experimental measurements of Raman microscopy were conducted at the single-cell level; however, the analysis was performed by averaging across the cells. The author did not discuss if Raman microscopy can used to detect cell-to-cell variability under the same condition.

      We thank the reviewer for raising this important point. Though this topic is beyond the scope of our study, some of our authors have addressed the application of single-cell Raman spectroscopy to characterizing phenotypic heterogeneity in individual Staphylococcus aureus cells in another paper (Kamei et al., bioRxiv, doi: 10.1101/2024.05.12.593718). Additionally, one of our authors demonstrated that single-cell RNA sequencing profiles can be inferred from Raman images of mouse cells (Kobayashi-Kirschvink et al., Nat. Biotechnol. 42, 1726–1734, 2024). Therefore, detecting cell-to-cell variability under the same conditions has been shown to be feasible. Whether averaging single-cell Raman spectra is necessary depends on the type of analysis and the available dataset. We will discuss this in more detail in our response to Comment (1) by Reviewer #1 (Recommendation for the authors).

      Discussion and impact on the field

      Raman signature contains both proteomic and metabolomic information and is an orthogonal method to infer the composition of biomolecules. It has the advantage that single-cell level data could be acquired and both in vivo and in vitro data can be compared. This work is a strong initiative for introducing the powerful technique to systems biology and providing a rigorous pipeline for future data analysis.

      Reviewer #2 (Public review):

      Summary and strengths:

      Kamei et al. observe the Raman spectra of a population of single E. coli cells in diverse growth conditions. Using LDA, Raman spectra for the different growth conditions are separated. Using previously available protein abundance data for these conditions, a linear mapping from Raman spectra in LDA space to protein abundance is derived. Notably, this linear map is condition-independent and is consequently shown to be predictive for held-out growth conditions. This is a significant result and in my understanding extends the earlier Raman to RNA connection that has been reported earlier.

      They further show that this linear map reveals something akin to bacterial growth laws (ala Scott/Hwa) that the certain collection of proteins shows stoichiometric conservation, i.e. the group (called SCG - stoichiometrically conserved group) maintains their stoichiometry across conditions while the overall scale depends on the conditions. Analyzing the changes in protein mass and Raman spectra under these conditions, the abundance ratios of information processing proteins (one of the large groups where many proteins belong to "information and storage" - ISP that is also identified as a cluster of orthologous proteins) remain constant. The mass of these proteins deemed, the homeostatic core, increases linearly with growth rate. Other SCGs and other proteins are condition-specific.

      Notably, beyond the ISP COG the other SCGs were identified directly using the proteome data. Taking the analysis beyond they then how the centrality of a protein - roughly measured as how many proteins it is stoichiometric with - relates to function and evolutionary conservation. Again significant results, but I am not sure if these ideas have been reported earlier, for example from the community that built protein-protein interaction maps.

      As pointed out, past studies have revealed that the function, essentiality, and evolutionary conservation of genes are linked to the topology of gene networks, including protein-protein interaction networks. However, to the best of our knowledge, their linkage to stoichiometry conservation centrality of each gene has not yet been established.

      Previously analyzed networks, such as protein-protein interaction networks, depend on known interactions. Therefore, as our understanding of the molecular interactions evolves with new findings, the conclusions may change. Furthermore, analysis of a particular interaction network cannot account for effects from different types of interactions or multilayered regulations affecting each protein species.

      In contrast, the stoichiometry conservation network in this study focuses solely on expression patterns as the net result of interactions and regulations among all types of molecules in cells. Consequently, the stoichiometry conservation networks are not affected by the detailed knowledge of molecular interactions and naturally reflect the global effects of multilayered interactions. Additionally, stoichiometry conservation networks can easily be obtained for non-model organisms, for which detailed molecular interaction information is usually unavailable. Therefore, analysis with the stoichiometry conservation network has several advantages over existing methods from both biological and technical perspectives.

      We added a paragraph explaining this important point to the Discussion section, along with additional literature.

      Finally, the paper built a lot of "machinery" to connect ¥Omega_LE, built directly from proteome, and ¥Omega_B, built from Raman, spaces. I am unsure how that helps and have not been able to digest the 50 or so pages devoted to this.

      The mathematical analyses in the supplementary materials form the basis of the argument in the main text. Without the rigorous mathematical discussions, Fig. 6E — one of the main conclusions of this study — and Fig. 7 could never be obtained. Therefore, we believe the analyses are essential to this study. However, we clarified why each analysis is necessary and significant in the corresponding sections of the Results to improve the manuscript's readability.

      Please see our responses to comments (2) and (7) by Reviewer #1 (Recommendations for the authors) and comments (5) and (6) by Reviewer #2 (Recommendations for the authors).

      Strengths:

      The rigorous analysis of the data is the real strength of the paper. Alongside this, the discovery of SCGs that are condition-independent and that are condition-dependent provides a great framework.

      Weaknesses:

      Overall, I think it is an exciting advance but some work is needed to present the work in a more accessible way.

      We edited the main text to make it more accessible to a broader audience. Please see our responses to comments (2) and (7) by Reviewer #1 (Recommendations for the authors) and comments (5) and (6) by Reviewer #2 (Recommendations for the authors).

      Reviewer #1 (Recommendations for the authors):

      (1) The Raman spectral data is measured from single-cell imaging. In the current work, most of the conclusions are from averaged data. From my understanding, once the correspondence between LDA and proteome data is established (i.e. the matrix B) one could infer the single-cell proteome composition from B. This would provide valuable information on how proteome composition fluctuates at the single-cell level.

      We can calculate single-cell proteomes from single-cell Raman spectra in the manner suggested by the reviewer. However, we cannot evaluate the accuracy of their estimation without single-cell proteome data under the same environmental conditions. Likewise, we cannot verify variations of estimated proteomes of single cells. Since quantitatively accurate single-cell proteome data is unavailable, we concluded that addressing this issue was beyond the scope of this study.

      Nevertheless, we agree with the reviewer that investigating how proteome composition fluctuates at the single-cell level based on single-cell Raman spectra is an intriguing direction for future research. In this regard, some of our authors have studied the phenotypic heterogeneity of Staphylococcus aureus cells using single-cell Raman spectra in another paper (Kamei et al., bioRxiv, doi: 10.1101/2024.05.12.593718), and one of our authors has demonstrated that single-cell RNA sequencing profiles can be inferred from Raman images of mouse cells (Kobayashi-Kirschvink et al., Nat. Biotechnol. 42, 1726–1734, 2024). Therefore, it is highly plausible that single-cell Raman spectroscopy can also characterize proteomic fluctuations in single cells. We have added a paragraph to the Discussion section to highlight this important point.

      (2) The establishment of matrix B is quite confusing for readers who only read the main text. I suggest adding a flow chart in Figure 1 to explain the data analysis pipeline, as well as state explicitly what is the dimension of B, LDA matrix, and proteome matrix.

      We thank the reviewer for the suggestion. Following the reviewer's advice, we have explicitly stated the dimensions of the vectors and matrices in the main text. We have also added descriptions of the dimensions of the constructed spaces. Rather than adding another flow chart to Figure 1, we added a new table (Table 1) to explain the various symbols representing vectors and matrices, thereby improving the accessibility of the explanation.

      (3) One of the main contributions for this work is to demonstrate how proteome stoichiometry is regulated across different conditions. A total of m=15 conditions were tested in this study, and this limits the rank of LDA matrix as 14. Therefore, maximally 14 "modes" of differential composition in a proteome can be detected.

      As a general reader, I am wondering in the future if one increases or decreases the number of conditions (say m=5 or m=50) what information can be extracted? It is conceivable that increasing different conditions with distinct cellular physiology would be beneficial to "explore" different modes of regulation for cells. As proof of principle, I am wondering if the authors could test a lower number (by sub-sampling from m=15 conditions, e.g. picking five of the most distinct conditions) and see how this would affect the prediction of proteome stoichiometry inference.

      We thank the reviewer for bringing an important point to our attention. To address the issue raised, we conducted a new subsampling analysis (Fig. S14).

      As we described in the main text (Fig. 6E) and the supplementary materials, the m x m orthogonal matrix, Θ, represents to what extent the two spaces Ω<sub>LE</sub> and Ω<sub>B</sub> are similar (m is the number of conditions; in our main analysis, m = 15). Thus, the low-dimensional correspondence between the two spaces connected by an orthogonal transformation, such as an m-dimensional rotation, can be evaluated by examining the elements of the matrix Θ. Specifically, large off-diagonal elements of the matrix  mix higher dimensions and lower dimensions, making the two spaces spanned by the first few major axes appear dissimilar. Based on this property, we evaluated the vulnerability of the low-dimensional correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub> to the reduced number of conditions by measuring how close Θ was to the identity matrix when the analysis was performed on the subsampled datasets.

      In the new figure (Fig. S14), we first created all possible smaller condition sets by subsampling the conditions. Next, to evaluate the closeness between the matrix Θ and the identity matrix for each smaller condition set, we generated 10,000 random orthogonal matrices of the same size as . We then evaluated the probability of obtaining a higher level of low-dimensional correspondence than that of the experimental data by chance (see section 1.8 of the Supplementary Materials). This analysis was already performed in the original manuscript for the non-subsampled case (m = 15) in Fig. S9C; the new analysis systematically evaluates the correspondence for the subsampled datasets.

      The results clearly show that low-dimensional correspondence is more likely to be obtained with more conditions (Fig. S14). In particular, when the number of conditions used in the analysis exceeds five, the median of the probability that random orthogonal matrices were closer to the identity matrix than the matrix Θ calculated from subsampled experimental data became lower than 10<sup>-4</sup>. This analysis provides insight into the number of conditions required to find low-dimensional correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub>.

      What conditions are used in the analysis can change the low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub> . Therefore, it is important to clarify whether including more conditions in the analysis reduces the dependence of the low-dimensional structures on conditions. We leave this issue as a subject for future study. This issue relates to the effective dimensionality of omics profiles needed to establish the diverse physiological states of cells across conditions. Determining the minimum number of conditions to attain the condition-independent low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub> would provide insight into this fundamental problem. Furthermore, such an analysis would identify the range of applications of Raman spectra as a tool for capturing macroscopic properties of cells at the system level.

      We now discuss this point in the Discussion section, referring to this analysis result (Fig. S14). Please also see our reply to the comment (1) by Reviewer #2 (Recommendations for the authors).

      (4) In E. coli cells, total proteome is in mM concentration while the total metabolites are between 10 to 100 mM concentration. Since proteins are large molecules with more functional groups, they may contribute to more Raman signal (per molecules) than metabolites. Still, the meaningful quantity here is the "differential Raman signal" with different conditions, not the absolute signal. I am wondering how much percent of differential Raman signature are from proteome and how much are from metabolome.

      It is an important and interesting question to what extent changes in the proteome and metabolome contribute to changes in Raman spectra. Though we concluded that answering this question is beyond the scope of this study, we believe it is an important topic for future research.

      Raman spectral patterns convey the comprehensive molecular composition spanning the various omics layers of target cells. Changes in the composition of these layers can be highly correlated, and identifying their contributions to changes in Raman spectra would provide insight into the mutual correlation of different omics layers. Addressing the issue raised by the reviewer would expand the applications of Raman spectroscopy and highlight the advantage of cellular Raman spectra as a means of capturing comprehensive multi-omics information.

      We note that some studies have evaluated the contributions of proteins, lipids, nucleic acids, and glycogen to the Raman spectra of mammalian cells and how these contributions change in different states (e.g., Mourant et al., J Biomed Opt, 10(3), 031106, 2005). Additionally, numerous studies have imaged or quantified metabolites in various cell types (see, for example, Cutshaw et al., Chemical Reviews, 123(13), 8297–8346, 2023, for a comprehensive review). Extending these approaches to multiple omics layers in future studies would help resolve the issue raised by the reviewer.

      (5) It is known that E. coli cells in different conditions have different cell sizes, where cell width increases with carbon source quality and growth rate. Does this effect be normalized when processing the Raman signal?

      Each spectrum was normalized by subtracting the average and dividing it by the standard deviation. This normalization minimizes the differences in signal intensities due to different cell sizes and densities. This information is shown in the Materials and Methods section of the Supplementary Materials.

      (6) I have a question about interpretation of the centrality index. A higher centrality indicates the protein expression pattern is more aligned with the "mainstream" of the other proteins in the proteome. However, it is possible that the proteome has multiple" mainstream modes" (with possibly different contributions in magnitudes), and the centrality seems to only capture the "primary mode". A small group of proteins could all have low centrality but have very consistent patterns with high conservation of stoichiometry. I wondering if the author could discuss and clarify with this.

      We thank the reviewer for drawing our attention to the insufficient explanation in the original manuscript. First, we note that stoichiometry conserving protein groups are not limited to those composed of proteins with high stoichiometry conservation centrality. The SCGs 2–5 are composed of proteins that strongly conserve stoichiometry within each group but have low stoichiometry conservation centrality (Fig. 5A, 5K, 5L, and 7A). In other words, our results demonstrate the existence of the "primary mainstream mode" (SCG 1, i.e., the homeostatic core) and condition-specific "non-primary mainstream modes" (SCGs 2–5). These primary and non-primary modes are distinguishable by their position along the axis of stoichiometry conservation centrality (Fig. 5A, 5K, and 5L).

      However, a single one-dimensional axis (centrality) cannot capture all characteristics of stoichiometry-conserving architecture. In our case, the "non-primary mainstream modes" (SCGs 2–5) were distinguished from each other by multiple csLE axes.

      To clarify this point, we modified the first paragraph of the section where we first introduce csLE (Revealing global stoichiometry conservation architecture of the proteomes with csLE). We also added a paragraph to the Discussion section regarding the condition-specific SCGs 2–5.

      (7) Figures 3, 4, and 5A-I are analyses on proteome data and are not related to Raman spectral data. I am wondering if this part of the analysis can be re-organized and not disrupt the mainline of the manuscript.

      We agree that the structure of this manuscript is complicated. Before submitting this manuscript to eLife, we seriously considered reorganizing it. However, we concluded that this structure was most appropriate because our focus on stoichiometry conservation cannot be explained without analyzing the coefficients of the Raman-proteome correspondence using COG classification (see Fig. 3; note that Fig. 3A relates to Raman data). This analysis led us to examine the global stoichiometry conservation architecture of proteomes (Figs. 4 and 5) and discover the unexpected similarity between the low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub>

      Therefore, we decided to keep the structure of the manuscript as it is. To partially resolve this issue, however, we added references to Fig. S1, the diagram of this paper’s mainline, to several places in the main text so that readers can more easily grasp the flow of the manuscript.

      (8) Supplementary Equation (2.6) could be wrong. From my understanding of the coordinate transformation definition here, it should be [w1 ... ws] X := RHS terms in big parenthesis.

      We checked the equation and confirmed that it is correct.

      Reviewer #2 (Recommendations for the authors):

      (1) The first main result or linear map between raman and proteome linked via B is intriguing in the sense that the map is condition-independent. A speculative question I have is if this relationship may become more complex or have more condition-dependent corrections as the number of conditions goes up. The 15 or so conditions are great but it is not clear if they are often quite restrictive. For example, they assume an abundance of most other nutrients. Now if you include a growth rate decrease due to nitrogen or other limitations, do you expect this to work?

      In our previous paper (Kobayashi-Kirschvink et al., Cell Systems 7(1): 104–117.e4, 2018), we statistically demonstrated a linear correspondence between cellular Raman spectra and transcriptomes for fission yeast under 10 environmental conditions. These conditions included nutrient-rich and nutrient-limited conditions, such as nitrogen limitation. Since the Raman-transcriptome correspondence was only statistically verified in that study, we analyzed the data from the standpoint of stoichiometry conservation in this study. The results (Fig. S11 and S12) revealed a correspondence in lower dimensions similar to that observed in our main results. In addition, similar correspondences were obtained even for different E. coli strains under common culture conditions (Fig. S11 and S12). Therefore, it is plausible that the stoichiometry-conservation low-dimensional correspondence between Raman and gene expression profiles holds for a wide range of external and internal perturbations.

      We agree with the reviewer that it is important to understand how Raman-omics correspondences change with the number of conditions. To address this issue, we examined how the correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub> changes by subsampling the conditions used in the analysis. We focused on , which was introduced in Fig. 5E, because the closeness of Θ to the identity matrix represents correspondence precision. We found a general trend that the low-dimensional correspondence becomes more precise as the number of conditions increases (Fig. S14). This suggests that increasing the number of conditions generally improves the correspondence rather than disrupting it.

      We added a paragraph to the Discussion section addressing this important point. Please also refer to our response to Comment (3) of Reviewer #1 (Recommendations for the authors).

      (2) A little more explanation in the text for 3C/D would help. I am imagining 3D is the control for 3C. Minor comment - 3B looks identical to S4F but the y-axis label is different.

      We thank the reviewer for pointing out the insufficient explanation of Fig. 3C and 3D in the main text. Following this advice, we added explanations of these plots to the main text. We also added labels ("ISP COG class" and "non-ISP COG class") to the top of these two figures.

      Fig. 3B and S4F are different. For simplicity, we used the Pearson correlation coefficient in Fig. 3B. However, cosine similarity is a more appropriate measure for evaluating the degree of conservation of abundance ratios. Thus, we presented the result using cosine similarity in a supplementary figure (Fig. S4F). Please note that each point in Fig. S4F is calculated between proteome vectors of two conditions. The dimension of each proteome vector is the number of genes in each COG class.

      (3) Can we see a log-log version of 4C to see how the low-abundant proteins are behaving? In fact, the same is in part true for Figure 3A.

      We added the semi-log version of the graph for SCG1 (the homeostatic core) in Fig. 4C to make low-abundant proteins more visible. Please note that the growth rates under the two stationary-phase conditions were zero; therefore, plotting this graph in log-log format is not possible.

      Fig. 3A cannot be shown as a log-log plot because many of the coefficients are negative. The insets in the graphs clarify the points near the origin.

      (4) In 5L, how should one interpret the other dots that are close to the center but not part of the SCG1? And this theme continues in 6ACD and 7A.

      The SCGs were obtained by setting a cosine similarity threshold. Therefore, proteins that are close to SCG 1 (the homeostatic core) but do not belong to it have a cosine similarity below the threshold with any protein in SCG 1. Fig. 7 illustrates the expression patterns of the proteins in question.

      (5) Finally, I do not fully appreciate the whole analysis of connecting ¥Omega_csLE and ¥Omega_B and plots in 6 and 7. This corresponds to a lot of linear algebra in the 50 or so pages in section 1.8 in the supplementary. If the authors feel this is crucial in some way it needs to be better motivated and explained. I philosophically appreciate developing more formalism to establish these connections but I did not understand how this (maybe even if in the future) could lead to a new interpretation or analysis or theory.

      The mathematical analyses included in the supplementary materials are important for readers who are interested in understanding the mathematics behind our conclusions. However, we also thought these arguments were too detailed for many readers when preparing the original submission and decided to show them in the supplemental materials.

      To better explain the motivation behind the mathematical analyses, we revised the section “Representing the proteomes using the Raman LDA axes”.

      Please also see our reply to the comment (6) by Reviewer #2 (Recommendations for the authors) below.

      (6) Along the lines of the previous point, there seems to be two separate points being made: a) there is a correspondence between Raman and proteins, and b) we can use the protein data to look at centrality, generality, SCGs, etc. And the two don't seem to be linked until the formalism of ¥Omegas?

      The reviewer is correct that we can calculate and analyze some of the quantities introduced in this study, such as stoichiometry conservation centrality and expression generality, without Raman data. However, it is difficult to justify introducing these quantities without analyzing the correspondence between the Raman and proteome profiles. Moreover, the definition of expression generality was derived from the analysis of Raman-proteome correspondence (see section 2.2 of the Supplementary Materials). Therefore, point b) cannot stand alone without point a) from its initial introduction.

      To partially improve the readability and resolve the issue of complicated structure of this manuscript, we added references to Fig. S1, which is a diagram of the paper’s mainline, to several places in the main text. Please also see our reply to the comment (7) by Reviewer #1 (Recommendations for the authors).

    1. Author response:

      We would like to thank the three Reviewers for their thoughtful comments and detailed feedback. We are pleased to hear that the Reviewers found our paper to be “providing more direct evidence for the role of signals in different frequency bands related to predictability and surprise” (R1), “well-suited to test evidence for predictive coding versus alternative hypotheses” (R2), and “timely and interesting” (R3).

      We perceive that the reviewers have an overall positive impression of the experiments and analyses, but find the text somewhat dense and would like to see additional statistical rigor, as well as in some cases additional analyses to be included in supplementary material. We therefore here provide a provisional letter addressing revisions we have already performed and outlining the revision we are planning point-by-point. We begin each enumerated point with the Reviewer’s quoted text and our responses to each point are made below.

      Reviewer 1:

      (1) Introduction:

      The authors write in their introduction: "H1 further suggests a role for θ oscillations in prediction error processing as well." Without being fleshed out further, it is unclear what role this would be, or why. Could the authors expand this statement?”

      We have edited the text to indicate that theta-band activity has been related to prediction error processing as an empirical observation, and must regrettably leave drawing inferences about its functional role to future work, with experiments designed specifically to draw out theta-band activity.

      (2) Limited propagation of gamma band signals:

      Some recent work (e.g. https://www.cell.com/cell-reports/fulltext/S2211-1247(23)00503-X) suggests that gamma-band signals reflect mainly entrainment of the fast-spiking interneurons, and don't propagate from V1 to downstream areas. Could the authors connect their findings to these emerging findings, suggesting no role in gamma-band activity in communication outside of the cortical column?”

      We have not specifically claimed that gamma propagates between columns/areas in our recordings, only that it synchronizes synaptic current flows between laminar layers within a column/area. We nonetheless suggest that gamma can locally synchronize a column, and potentially local columns within an area via entrainment of local recurrent spiking, to update an internal prediction/representation upon onset of a prediction error. We also point the Reviewer to our Discussion section, where we state that our results fit with a model “whereby θ oscillations synchronize distant areas, enabling them to exchange relevant signals during cognitive processing.” In our present work, we therefore remain agnostic about whether theta or gamma or both (or alternative mechanisms) are at play in terms of how prediction error signals are transmitted between areas.

      (3) Paradigm:

      While I agree that the paradigm tests whether a specific type of temporal prediction can be formed, it is not a type of prediction that one would easily observe in mice, or even humans. The regularity that must be learned, in order to be able to see a reflection of predictability, integrates over 4 stimuli, each shown for 500 ms with a 500 ms blank in between (and a 1000 ms interval separating the 4th stimulus from the 1st stimulus of the next sequence). In other words, the mouse must keep in working memory three stimuli, which partly occurred more than a second ago, in order to correctly predict the fourth stimulus (and signal a 1000 ms interval as evidence for starting a new sequence).

      A problem with this paradigm is that positive findings are easier to interpret than negative findings. If mice do not show a modulation to the global oddball, is it because "predictive coding" is the wrong hypothesis, or simply because the authors generated a design that operates outside of the boundary conditions of the theory? I think the latter is more plausible. Even in more complex animals, (eg monkeys or humans), I suspect that participants would have trouble picking up this regularity and sequence, unless it is directly task-relevant (which it is not, in the current setting). Previous experiments often used simple pairs (where transitional probability was varied, eg, Meyer and Olson, PNAS 2012) of stimuli that were presented within an intervening blank period. Clearly, these regularities would be a lot simpler to learn than the highly complex and temporally spread-out regularity used here, facilitating the interpretation of negative findings (especially in early cortical areas, which are known to have relatively small temporal receptive fields).

      I am, of course, not asking the authors to redesign their study. I would like to ask them to discuss this caveat more clearly, in the Introduction and Discussion, and situate their design in the broader literature. For example, Jeff Gavornik has used much more rapid stimulus designs and observed clear modulations of spiking activity in early visual regions. I realize that this caveat may be more relevant for the spiking paper (which does not show any spiking activity modulation in V1 by global predictability) than for the current paper, but I still think it is an important general caveat to point out.”

      We appreciate the Reviewer’s concern about working memory limitations in mice. Our paradigm and training followed on from previous paradigms such as Gavornik and Bear (2014), in which predictive effects were observed in mouse V1 with presentation times of 150ms and interstimulus intervals of 1500ms. In addition, we note that Jamali et al. (2024) recently utilized a similar global/local paradigm in the auditory domain with inter-sequence intervals as long as 28-30 seconds, and still observed effects of a predicted sequence (https://elifesciences.org/articles/102702). For the revised manuscript, we plan to expand on this in the Discussion section.

      That being said, as the Reviewer also pointed out, this would be a greater concern had we not found any positive findings in our study. However, even with the rather long sequence periods we used, we did find positive evidence for predictive effects, supporting the use of our current paradigm. We agree with the reviewer that these positive effects are easier to interpret than negative effects, and plan to expand upon this in the Discussion when we resubmit.

      (4) Reporting of results:

      I did not see any quantification of the strength of evidence of any of the results, beyond a general statement that all reported results pass significance at an alpha=0.01 threshold. It would be informative to know, for all reported results, what exactly the p-value of the significant cluster is; as well as for which performed tests there was no significant difference.”

      For the revised manuscript, we can include the p-values after cluster-based testing for each significant cluster, as well as show data that passes a more stringent threshold of p<0.001 (1/1000) or p<0.005 (1/200) rather than our present p<0.01 (1/100).

      (5) Cluster test:

      The authors use a three-dimensional cluster test, clustering across time, frequency, and location/channel. I am wondering how meaningful this analytical approach is. For example, there could be clusters that show an early difference at some location in low frequencies, and then a later difference in a different frequency band at another (adjacent) location. It seems a priori illogical to me to want to cluster across all these dimensions together, given that this kind of clustering does not appear neurophysiologically implausible/not meaningful. Can the authors motivate their choice of three-dimensional clustering, or better, facilitating interpretability, cluster eg at space and time within specific frequency bands (2d clustering)?”

      We are happy to include a 3D plot of a time-channel-frequency cluster in the revised manuscript to clarify our statistical approach for the reviewer. We consider our current three-dimensional cluster-testing an “unsupervised” way of uncovering significant contrasts with no theory-driven assumptions about which bounded frequency bands or layers do what.

      Reviewer 2:

      Sennesh and colleagues analyzed LFP data from 6 regions of rodents while they were habituated to a stimulus sequence containing a local oddball (xxxy) and later exposed to either the same (xxxY) or a deviant global oddball (xxxX). Subsequently, they were exposed to a controlled random sequence (XXXY) or a controlled deterministic sequence (xxxx or yyyy). From these, the authors looked for differences in spectral properties (both oscillatory and aperiodic) between three contrasts (only for the last stimulus of the sequence).

      (1) Deviance detection: unpredictable random (XXXY) versus predictable habituation (xxxy)

      (2) Global oddball: unpredictable global oddball (xxxX) versus predictable deterministic (xxxx), and

      (3) "Stimulus-specific adaptation:" locally unpredictable oddball (xxxY) versus predictable deterministic (yyyy).

      They found evidence for an increase in gamma (and theta in some cases) for unpredictable versus predictable stimuli, and a reduction in alpha/beta, which they consider evidence towards the "predictive routing" scheme.

      While the dataset and analyses are well-suited to test evidence for predictive coding versus alternative hypotheses, I felt that the formulation was ambiguous, and the results were not very clear. My major concerns are as follows:”

      We appreciate the reviewer’s concerns and outline how we will address them below:

      (1) The authors set up three competing hypotheses, in which H1 and H2 make directly opposite predictions. However, it must be noted that H2 is proposed for spatial prediction, where the predictability is computed from the part of the image outside the RF. This is different from the temporal prediction that is tested here. Evidence in favor of H2 is readily observed when large gratings are presented, for which there is substantially more gamma than in small images. Actually, there are multiple features in the spectral domain that should not be conflated, namely (i) the transient broadband response, which includes all frequencies, (ii) contribution from the evoked response (ERP), which is often in frequencies below 30 Hz, (iii) narrow-band gamma oscillations which are produced by large and continuous stimuli (which happen to be highly predictive), and (iv) sustained low-frequency rhythms in theta and alpha/beta bands which are prominent before stimulus onset and reduce after ~200 ms of stimulus onset. The authors should be careful to incorporate these in their formulation of PC, and in particular should not conflate narrow-band and broadband gamma.”

      We have clarified in the manuscript that while the gamma-as-prediction hypothesis (our H2) was originally proposed in a spatial prediction domain, further work (specifically Singer (2021)) has extended the hypothesis to cover temporal-domain predictions as well.

      To address the reviewer’s point about multiple features in the spectral domain: Our analysis has specifically separated aperiodic components using FOOOF analysis (Supp. Fig. 1) and explicitly fit and tested aperiodic vs. periodic components (Supp. Figs 1&2). We did not find strong effects in the aperiodic components but did in the periodic components (Supp. Fig. 2), allowing us to be more confident in our conclusions in terms of genuine narrow-band oscillations. In the revised manuscript, we will include analysis of the pre-stimulus time window to address the reviewer’s point (iv) on sustained low frequency oscillations.

      (2) My understanding is that any aspect of predictive coding must be present before the onset of stimulus (expected or unexpected). So, I was surprised to see that the authors have shown the results only after stimulus onset. For all figures, the authors should show results from -500 ms to 500 ms instead of zero to 500 ms.

      In our revised manuscript we will include a pre-stimulus analysis and supplementary figures with time ranges from -500ms to 500ms. We have only refrained from doing so in the initial manuscript because our paradigm’s short interstimulus interval makes it difficult to interpret whether activity in the ISI reflects post-stimulus dynamics or pre-stimulus prediction. Nonetheless, we can easily show that in our paradigm, alpha/beta-band activity is elevated in the interstimulus activity after the offset of the previous stimulus, assuming that we baseline to the pre-trial period.

      (3) In many cases, some change is observed in the initial ~100 ms of stimulus onset, especially for the alpha/beta and theta ranges. However, the evoked response contributes substantially in the transient period in these frequencies, and this evoked response could be different for different conditions. The authors should show the evoked responses to confirm the same, and if the claim really is that predictions are carried by genuine "oscillatory" activity, show the results after removing the ERP (as they had done for the CSD analysis).

      We have included an extra sentence in our Materials and Methods section clarifying that the evoked potential/ERP was removed in our existing analyses, prior to performing the spectral decomposition of the LFP signal. We also note that the FOOOF analysis we applied separates aperiodic components of the spectral signal from the strictly oscillatory ones.

      In our revised manuscript we will include an analysis of the evoked responses as suggested by the reviewer.

      (4) I was surprised by the statistics used in the plots. Anything that is even slightly positive or negative is turning out to be significant. Perhaps the authors could use a more stringent criterion for multiple comparisons?

      As noted above to Reviewer 1 (point 4), we are happy to include supplemental figures in our resubmission showing the effects on our results of setting the statistical significance threshold with considerably greater stringency.

      (5) Since the design is blocked, there might be changes in global arousal levels. This is particularly important because the more predictive stimuli in the controlled deterministic stimuli were presented towards the end of the session, when the animal is likely less motivated. One idea to check for this is to do the analysis on the 3rd stimulus instead of the 4th? Any general effect of arousal/attention will be reflected in this stimulus.

      In order to check for the brain-wide effects of arousal, we plan to perform similar analyses to our existing ones on the 3rd stimulus in each block, rather than just the 4th “oddball” stimulus. Clusters that appear significantly contrasting in both the 3rd and 4th stimuli may be attributable to arousal.  We will also analyze pupil size as an index of arousal to check for arousal differences between conditions in our contrasts, possibly stratifying our data before performing comparisons to equalize pupil size within contrasts. We plan to include these analyses in our resubmission.

      (6) The authors should also acknowledge/discuss that typical stimulus presentation/attention modulation involves both (i) an increase in broadband power early on and (ii) a reduction in low-frequency alpha/beta power. This could be just a sensory response, without having a role in sending prediction signals per se. So the predictive routing hypothesis should involve testing for signatures of prediction while ruling out other confounds related to stimulus/cognition. It is, of course, very difficult to do so, but at the same time, simply showing a reduction in low-frequency power coupled with an increase in high-frequency power is not sufficient to prove PR.

      Since many different predictive coding and predictive processing hypotheses make very different hypotheses about how predictions might encoded in neurophysiological recordings, we have focused on prediction error encoding in this paper.

      For the hypothesis space we have considered (H1-H3), each hypothesis makes clearly distinguishable predictions about the spectral response during the time period in the task when prediction errors should be present. As noted by the reviewer, a transient increase in broadband frequencies would be a signature of H3. Changes to oscillatory power in the gamma band in distinct directions (e.g., increasing or decreasing with prediction error) would support either H1 and H2, depending on the direction of change. We believe our data, especially our use of FOOOF analysis and separation of periodic from aperiodic components, coupled to the three experimental contrasts, speaks clearly in favor of the Predictive Routing model, but we do not claim we have “proved” it. This study provides just one datapoint, and we will acknowledge this in our revised Discussion in our resubmission.

      (7) The CSD results need to be explained better - you should explain on what basis they are being called feedforward/feedback. Was LFP taken from Layer 4 LFP (as was done by van Kerkoerle et al, 2014)? The nice ">" and "<" CSD patterns (Figure 3B and 3F of their paper) in that paper are barely observed in this case, especially for the alpha/beta range.

      We consider a feedforward pattern as flowing from L4 outwards to L2/3 and L5/6, and a feedback pattern as flowing in the opposite direction, from L1 and L6 to the middle layers. We will clarify this in the revised manuscript.

      Since gamma-band oscillations are strongest in L2/3, we re-epoched LFPs to the oscillation troughs in L2/3 in the initial manuscript. We can include in the revised manuscript equivalent plots after finding oscillation troughs in L4 instead, as well as calculating the difference in trough times within-band between layers to quantify the transmission delay and add additional rigor to our feedforward vs. feedback interpretation of the CSD data.

      (8) Figure 4a-c, I don't see a reduction in the broadband signal in a compared to b in the initial segment. Maybe change the clim to make this clearer?

      We are looking into the clim/colorbar and plot-generation code to figure out the visibility issue that the Reviewer has kindly pointed out to us.

      (9) Figure 5 - please show the same for all three frequency ranges, show all bars (including the non-significant ones), and indicate the significance (p-values or by *, **, ***, etc) as done usually for bar plots.

      We will add the requested bar-plots for all frequency ranges, though we note that the bars given here are the results of adding up the spectral power in the channel-time-frequency clusters that already passed significance tests and that adding secondary significance tests here may not prove informative.

      (10) Their claim of alpha/beta oscillations being suppressed for unpredictable conditions is not as evident. A figure akin to Figure 5 would be helpful to see if this assertion holds.

      As noted above, we will include the requested bar plot, as well as examining alpha/beta in the pre-stimulus time-series rather than after the onset of the oddball stimulus.

      (11) To investigate the prediction and violation or confirmation of expectation, it would help to look at both the baseline and stimulus periods in the analyses.

      We will include for the Reviewer’s edification a supplementary figure showing the spectrograms for the baseline and full-trial periods to look at the difference between baseline and prestimulus expectation.

      Reviewer 3:

      Summary:

      In their manuscript entitled "Ubiquitous predictive processing in the spectral domain of sensory cortex", Sennesh and colleagues perform spectral analysis across multiple layers and areas in the visual system of mice. Their results are timely and interesting as they provide a complement to a study from the same lab focussed on firing rates, instead of oscillations. Together, the present study argues for a hypothesis called predictive routing, which argues that non-predictable stimuli are gated by Gamma oscillations, while alpha/beta oscillations are related to predictions.

      Strengths:

      (1) The study contains a clear introduction, which provides a clear contrast between a number of relevant theories in the field, including their hypotheses in relation to the present data set.

      (2) The study provides a systematic analysis across multiple areas and layers of the visual cortex.”

      We thank the Reviewer for their kind comments.

      Weaknesses:

      (1) It is claimed in the abstract that the present study supports predictive routing over predictive coding; however, this claim is nowhere in the manuscript directly substantiated. Not even the differences are clearly laid out, much less tested explicitly. While this might be obvious to the authors, it remains completely opaque to the reader, e.g., as it is also not part of the different hypotheses addressed. I guess this result is meant in contrast to reference 17, by some of the same authors, which argues against predictive coding, while the present work finds differences in the results, which they relate to spectral vs firing rate analysis (although without direct comparison).

      We agree that in this manuscript we should restrict ourselves to the hypotheses that were directly tested. We have revised our abstract accordingly,  and softened our claim to note only that our LFP results are compatible with predictive routing.

      (2) Most of the claims about a direction of propagation of certain frequency-related activities (made in the context of Figures 2-4) are - to the eyes of the reviewer - not supported by actual analysis but glimpsed from the pictures, sometimes, with very little evidence/very small time differences to go on. To keep these claims, proper statistical testing should be performed.

      In our revised manuscript, we will either substantiate (with quantification of CSD delays between layers) or soften the claims about feedforward/feedback direction of flow within the cortical column.

      (3) Results from different areas are barely presented. While I can see that presenting them in the same format as Figures 2-4 would be quite lengthy, it might be a good idea to contrast the right columns (difference plots) across areas, rather than just the overall averages.

      In our revised manuscript we will gladly include a supplementary figure showing the right-column difference plots across areas, in order to make sure to include aspects of our dataset that span up and down the cortical hierarchy.

      (4) Statistical testing is treated very generally, which can help to improve the readability of the text; however, in the present case, this is a bit extreme, with even obvious tests not reported or not even performed (in particular in Figure 5).

      We appreciate the Reviewer’s concern for statistical rigor, and as noted to the other reviewers, we can add different levels of statistical description and describe the p-values associated with specific clusters. Regarding Figure 5, we must protest as the bar heights were computed came from clusters already subjected to statistical testing and found significant.  We could add a supplementary figure which considers untested narrowband activity and tests it only in the “bar height” domain, if the Reviewer would like.

      (5) The description of the analysis in the methods is rather short and, to my eye, was missing one of the key descriptions, i.e., how the CSD plots were baselined (which was hinted at in the results, but, as far as I know, not clearly described in the analysis methods). Maybe the authors could section the methods more to point out where this is discussed.

      We have added some elaboration to our Materials and Methods section, especially to specify that CSD, having physical rather than arbitrary units, does not require baselining.

      (6) While I appreciate the efforts of the authors to formulate their hypotheses and test them clearly, the text is quite dense at times. Partly this is due to the compared conditions in this paradigm; however, it would help a lot to show a visualization of what is being compared in Figures 2-4, rather than just showing the results.

      In the revised manuscript we will add a visual aid for the three contrasts we consider.

      We are happy to inform the editors that we have implemented, for the Reviewed Preprint, the direct textual Recommendations for the Authors given by Reviewers 2 and 3. We will implement the suggested Figure changes in our revised manuscript. We thank them for their feedback in strengthening our manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study develops and validates a neural subspace similarity analysis for testing whether neural representations of graph structures generalize across graph size and stimulus sets. The authors show the method works in rat grid and place cell data, finding that grid but not place cells generalize across different environments, as expected. The authors then perform additional analyses and simulations to show that this method should also work on fMRI data. Finally, the authors test their method on fMRI responses from the entorhinal cortex (EC) in a task that involves graphs that vary in size (and stimulus set) and statistical structure (hexagonal and community). They find neural representations of stimulus sets in lateral occipital complex (LOC) generalize across statistical structure and that EC activity generalizes across stimulus sets/graph size, but only for the hexagonal structures.

      Strengths:

      (1) The overall topic is very interesting and timely and the manuscript is well-written.

      (2) The method is clever and powerful. It could be important for future research testing whether neural representations are aligned across problems with different state manifestations.

      (3) The findings provide new insights into generalizable neural representations of abstract task states in the entorhinal cortex.

      We thank the reviewer for their kind comments and clear summary of the paper and its strengths.

      Weaknesses:

      (1) The manuscript would benefit from improving the figures. Moreover, the clarity could be strengthened by including conceptual/schematic figures illustrating the logic and steps of the method early in the paper. This could be combined with an illustration of the remapping properties of grid and place cells and how the method captures these properties.

      We agree with the reviewer and have added a schematic figure of the method (figure 1a).

      (2) Hexagonal and community structures appear to be confounded by training order. All subjects learned the hexagonal graph always before the community graph. As such, any differences between the two graphs could thus be explained (in theory) by order effects (although this is practically unlikely). However, given community and hexagonal structures shared the same stimuli, it is possible that subjects had to find ways to represent the community structures separately from the hexagonal structures. This could potentially explain why the authors did not find generalizations across graph sizes for community structures.

      We thank the reviewer for their comments. We agree that the null result regarding the community structures does not mean that EC doesn’t generalise over these structures, and that the training order could in theory contribute to the lack of an effect. The decision to keep the asymmetry of the training order was deliberate: we chose this order based on our previous study (Mark et al. 2020), where we show that learning a community structure first changes the learning strategy of subsequent graphs. We could have perhaps overcome this by increasing the training periods, but 1) the training period is already very long; 2) there will still be asymmetry because the group that first learn community structure will struggle in learning the hexagonal graph more than vice versa, as shown in Mark et al. 2020.

      We have added the following sentences on this decision to the Methods section:

      “We chose to first teach hexagonal graphs for all participants and not randomize the order because of previous results showing that first learning community structure changes participants’ learning strategy (mark et al. 2020).”

      (3) The authors include the results from a searchlight analysis to show the specificity of the effects of EC. A better way to show specificity would be to test for a double dissociation between the visual and structural contrast in two independently defined regions (e.g., anatomical ROIs of LOC and EC).

      Thanks for this suggestion. We indeed tried to run the analysis in a whole-ROI approach, but this did not result in a significant effect in EC. Importantly, we disagree with the reviewer that this is a “better way to show specificity” than the searchlight approach. In our view, the two analyses differ with respect to the spatial extent of the representation they test for. The searchlight approach is testing for a highly localised representation on the scale of small spheres with only 100 voxels. The signal of such a localised representation is likely to be drowned in the noise in an analysis that includes thousands of voxels which mostly don’t show the effect - as would be the case in the whole-ROI approach.

      (4) Subjects had more experience with the hexagonal and community structures before and during fMRI scanning. This is another confound, and possible reason why there was no generalization across stimulus sets for the community structure.

      See our response to comment (2).

      Reviewer #2 (Public review):

      Summary:

      Mark and colleagues test the hypothesis that entorhinal cortical representations may contain abstract structural information that facilitates generalization across structurally similar contexts. To do so, they use a method called "subspace generalization" designed to measure abstraction of representations across different settings. The authors validate the method using hippocampal place cells and entorhinal grid cells recorded in a spatial task, then perform simulations that support that it might be useful in aggregated responses such as those measured with fMRI. Then the method is applied to fMRI data that required participants to learn relationships between images in one of two structural motifs (hexagonal grids versus community structure). They show that the BOLD signal within an entorhinal ROI shows increased measures of subspace generalization across different tasks with the same hexagonal structure (as compared to tasks with different structures) but that there was no evidence for the complementary result (ie. increased generalization across tasks that share community structure, as compared to those with different structures). Taken together, this manuscript describes and validates a method for identifying fMRI representations that generalize across conditions and applies it to reveal entorhinal representations that emerge across specific shared structural conditions.

      Strengths:

      I found this paper interesting both in terms of its methods and its motivating questions. The question asked is novel and the methods employed are new - and I believe this is the first time that they have been applied to fMRI data. I also found the iterative validation of the methodology to be interesting and important - showing persuasively that the method could detect a target representation - even in the face of a random combination of tuning and with the addition of noise, both being major hurdles to investigating representations using fMRI.

      We thank the reviewer for their kind comments and the clear summary of our paper.

      Weaknesses:

      In part because of the thorough validation procedures, the paper came across to me as a bit of a hybrid between a methods paper and an empirical one. However, I have some concerns, both on the methods development/validation side, and on the empirical application side, which I believe limit what one can take away from the studies performed.

      We thank the reviewer for the comment. We agree that the paper comes across as a bit of a methods-empirical hybrid. We chose to do this because we believe (as the reviewer also points out) that there is value in both aspects of the paper.

      Regarding the methods side, while I can appreciate that the authors show how the subspace generalization method "could" identify representations of theoretical interest, I felt like there was a noticeable lack of characterization of the specificity of the method. Based on the main equation in the results section of the paper, it seems like the primary measure used here would be sensitive to overall firing rates/voxel activations, variance within specific neurons/voxels, and overall levels of correlation among neurons/voxels. While I believe that reasonable pre-processing strategies could deal with the first two potential issues, the third seems a bit more problematic - as obligate correlations among neurons/voxels surely exist in the brain and persist across context boundaries that are not achieving any sort of generalization (for example neurons that receive common input, or voxels that share spatial noise). The comparative approach (ie. computing difference in the measure across different comparison conditions) helps to mitigate this concern to some degree - but not completely - since if one of the conditions pushes activity into strongly spatially correlated dimensions, as would be expected if univariate activations were responsive to the conditions, then you'd expect generalization (driven by shared univariate activation of many voxels) to be specific to that set of conditions.

      We thank the reviewer for their comments. We would like to point out that we demean each voxel within all states/piles (3-pictures sequences) in a given graph/task (what the reviewer is calling “a condition”). Hence there is no shared univariate activation of many voxels in response to a graph going into the computation, and no sensitivity to the overall firing rate/voxel activation.  Our calculation captures the variance across states conditions within a task (here a graph), over and above the univariate effect of graph activity. In addition, we spatially pre-whiten the data within each searchlight, meaning that noisy voxels with high noise variance will be downweighted and noise correlations between voxels are removed prior to applying our method.

      A second issue in terms of the method is that there is no comparison to simpler available methods. For example, given the aims of the paper, and the introduction of the method, I would have expected the authors to take the Neuron-by-Neuron correlation matrices for two conditions of interest, and examine how similar they are to one another, for example by correlating their lower triangle elements. Presumably, this method would pick up on most of the same things - although it would notably avoid interpreting high overall correlations as "generalization" - and perhaps paint a clearer picture of exactly what aspects of correlation structure are shared. Would this method pick up on the same things shown here? Is there a reason to use one method over the other?

      We thank the reviewer for this important and interesting point. We agree that calculating correlation between the upper triangular elements of the covariance or correlation matrices picks up similar, but not identical aspects of the data (see below the mathematical explanation that was added to the supplementary). When we repeated the searchlight analysis and calculated the correlation between the upper triangular entries of the Pearson correlation matrices we obtained an effect in the EC, though weaker than with our subspace generalization method (t=3.9, the effect did not survive multiple comparisons). Similar results were obtained with the correlation between the upper triangular elements of the covariance matrices(t=3.8, the effect did not survive multiple comparisons).

      The difference between the two methods is twofold: 1) Our method is based on the covariance matrix and not the correlation matrix - i.e. a difference in normalisation. We realised that in the main text of the original paper we mistakenly wrote “correlation matrix” rather than “covariance matrix” (though our equations did correctly show the covariance matrix). We have corrected this mistake in the revised manuscript. 2) The weighting of the variance explained in the direction of each eigenvector is different between the methods, with some benefits of our method for identifying low-dimensional representations and for robustness to strong spatial correlations.  We have added a section “Subspace Generalisation vs correlating the Neuron-by-Neuron correlation matrices” to the supplementary information with a mathematical explanation of these differences.

      Regarding the fMRI empirical results, I have several concerns, some of which relate to concerns with the method itself described above. First, the spatial correlation patterns in fMRI data tend to be broad and will differ across conditions depending on variability in univariate responses (ie. if a condition contains some trials that evoke large univariate activations and others that evoke small univariate activations in the region). Are the eigenvectors that are shared across conditions capturing spatial patterns in voxel activations? Or, related to another concern with the method, are they capturing changing correlations across the entire set of voxels going into the analysis? As you might expect if the dynamic range of activations in the region is larger in one condition than the other?

      This is a searchlight analysis, therefore it captures the activity patterns within nearby voxels. Indeed, as we show in our simulation, areas with high activity and therefore high signal to noise will have better signal in our method as well. Note that this is true of most measures.

      My second concern is, beyond the specificity of the results, they provide only modest evidence for the key claims in the paper. The authors show a statistically significant result in the Entorhinal Cortex in one out of two conditions that they hypothesized they would see it. However, the effect is not particularly large. There is currently no examination of what the actual eigenvectors that transfer are doing/look like/are representing, nor how the degree of subspace generalization in EC may relate to individual differences in behavior, making it hard to assess the functional role of the relationship. So, at the end of the day, while the methods developed are interesting and potentially useful, I found the contributions to our understanding of EC representations to be somewhat limited.

      We agree with this point, yet believe that the results still shed light on EC functionality. Unfortunately, we could not find correlation between behavioral measures and the fMRI effect.

      Reviewer #3 (Public review):

      Summary:

      The article explores the brain's ability to generalize information, with a specific focus on the entorhinal cortex (EC) and its role in learning and representing structural regularities that define relationships between entities in networks. The research provides empirical support for the longstanding theoretical and computational neuroscience hypothesis that the EC is crucial for structure generalization. It demonstrates that EC codes can generalize across non-spatial tasks that share common structural regularities, regardless of the similarity of sensory stimuli and network size.

      Strengths:

      (1) Empirical Support: The study provides strong empirical evidence for the theoretical and computational neuroscience argument about the EC's role in structure generalization.

      (2) Novel Approach: The research uses an innovative methodology and applies the same methods to three independent data sets, enhancing the robustness and reliability of the findings.

      (3) Controlled Analysis: The results are robust against well-controlled data and/or permutations.

      (4) Generalizability: By integrating data from different sources, the study offers a comprehensive understanding of the EC's role, strengthening the overall evidence supporting structural generalization across different task environments.

      Weaknesses:

      A potential criticism might arise from the fact that the authors applied innovative methods originally used in animal electrophysiology data (Samborska et al., 2022) to noisy fMRI signals. While this is a valid point, it is noteworthy that the authors provide robust simulations suggesting that the generalization properties in EC representations can be detected even in low-resolution, noisy data under biologically plausible assumptions. I believe this is actually an advantage of the study, as it demonstrates the extent to which we can explore how the brain generalizes structural knowledge across different task environments in humans using fMRI. This is crucial for addressing the brain's ability in non-spatial abstract tasks, which are difficult to test in animal models.

      While focusing on the role of the EC, this study does not extensively address whether other brain areas known to contain grid cells, such as the mPFC and PCC, also exhibit generalizable properties. Additionally, it remains unclear whether the EC encodes unique properties that differ from those of other systems. As the authors noted in the discussion, I believe this is an important question for future research.

      We thank the reviewer for their comments. We agree with the reviewer that this is a very interesting question. We tried to look for effects in the mPFC, but we did not obtain results that were strong enough to report in the main manuscript, but we do report a small effect in the supplementary.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) I wonder how important the PCA on B1(voxel-by-state matrix from environment 1) and the computation of the AUC (from the projection on B2 [voxel-by-state matrix from environment 1]) is for the analysis to work. Would you not get the same result if you correlated the voxel-by-voxel correlation matrix based on B1 (C1) with the voxel-by-voxel correlation matrix based on B2 (C2)? I understand that you would not have the subspace-by-subspace resolution that comes from the individual eigenvectors, but would the AUC not strongly correlate with the correlation between C1 and C2?

      We agree with the reviewer comments - see our response to reviewer 2 second issue above. 

      (2) There is a subtle difference between how the method is described for the neural recording and fMRI data. Line 695 states that principal components of the neuron x neuron intercorrelation matrix are computed, whereas line 888 implies that principal components of the data matrix B are computed. Of note, B is a voxel x pile rather than a pile x voxel matrix. Wouldn't this result in U being pile x pile rather than voxel x voxel?

      The PCs are calculated on the neuron x neuron (or voxel x voxel) covariance matrix of the activation matrix. We’ve added the following clarification to the relevant part of the Methods:

      “We calculated noise normalized GLM betas within each searchlight using the RSA toolbox. For each searchlight and each graph, we had a nVoxels (100) by nPiles (10) activation matrix (B) that describes the activation of a voxel as a result of a particular pile (three pictures’ sequence). We exploited the (voxel x voxel) covariance matrix of this matrix to quantify the manifold alignment within each searchlight.”

      (3) It would be very helpful to the field if the authors would make the code and data publicly available. Please consider depositing the code for data analysis and simulations, as well as the preprocessed/extracted data for the key results (rat data/fMRI ROI data) into a publicly accessible repository.

      The code is publicly available in git (https://github.com/ShirleyMgit/subspace_generalization_paper_code/tree/main).

      (4) Line 219: "Kolmogorov Simonov test" should be "Kolmogorov Smirnov test".

      thanks!

      (5) Please put plots in Figure 3F on the same y-axis.

      (6) Were large and small graphs of a given statistical structure learned on the same days, and if so, sequentially or simultaneously? This could be clarified.

      The graphs are learned on the same day.  We clarified this in the Methods section.

      Reviewer #2 (Recommendations for the authors):

      Perhaps the advantage of the method described here is that you could narrow things down to the specific eigenvector that is doing the heavy lifting in terms of generalization... and then you could look at that eigenvector to see what aspect of the covariance structure persists across conditions of interest. For example, is it just the highest eigenvalue eigenvector that is likely picking up on correlations across the entire neural population? Or is there something more specific going on? One could start to get at this by looking at Figures 1A and 1C - for example, the primary difference for within/between condition generalization in 1C seems to emerge with the first component, and not much changes after that, perhaps suggesting that in this case, the analysis may be picking up on something like the overall level of correlations within different conditions, rather than a more specific pattern of correlations.

      The nature of the analysis means the eigenvectors are organized by their contribution to the variance, therefore the first eigenvector is responsible for more variance than the other, we did not check rigorously whether the variance is then splitted equally by the remaining eigenvectors but it does not seems to be the case.

      Why is variance explained above zero for fraction EVs = 0 for figure 1C (but not 1A) ? Is there some plotting convention that I'm missing here?

      There was a small bug in this plot and it was corrected - thank you very much!

      The authors say:

      "Interestingly, the difference in AUCs was also 190 significantly smaller than chance for place cells (Figure 1a, compare dotted and solid green 191 lines, p<0.05 using permutation tests, see statistics and further examples in supplementary 192 material Figure S2), consistent with recent models predicting hippocampal remapping that is 193 not fully random (Whittington et al. 2020)."

      But my read of the Whittington model is that it would predict slight positive relationships here, rather than the observed negative ones, akin to what one would expect if hippocampal neurons reflect a nonlinear summation of a broad swath of entorhinal inputs.

      Smaller differences than chance imply that the remapping of place cells is not completely random.

      Figure 2:

      I didn't see any description of where noise amplitude values came from - or any justification at all in that section. Clearly, the amount of noise will be critical for putting limits on what can and cannot be detected with the method - I think this is worthy of characterization and explanation. In general, more information about the simulations is necessary to understand what was done in the pseudovoxel simulations. I get the gist of what was done, but these methods should clear enough that someone could repeat them, and they currently are not.

      Thanks, we added noise amplitude to the figure legend and Methods.

      What does flexible mean in the title? The analysis only worked for the hexagonal grid - doesn't that suggest that whatever representations are uncovered here are not flexible in the sense of being able to encode different things?

      Flexible here means, flexible over stimulus’ characteristics that are not related to the structural form such as stimuli, the size of the graph etc.

      Reviewer #3 (Recommendations for the authors):

      I have noticed that the authors have updated the previous preprint version to include extensive simulations. I believe this addition helps address potential criticisms regarding the signal-to-noise ratio. If the authors could share the code for the fMRI data and the simulations in an open repository, it would enhance the study's impact by reaching a broader readership across various research fields. Except for that, I have nothing to ask for revision.

      Thanks, the code will be publicly available: (https://github.com/ShirleyMgit/subspace_generalization_paper_code/tree/main).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The authors present exciting new experimental data on the antigenic recognition of 78 H3N2 strains (from the beginning of the 2023 Northern Hemisphere season) against a set of 150 serum samples. The authors compare protection profiles of individual sera and find that the antigenic effect of amino acid substitutions at specific sites depends on the immune class of the sera, differentiating between children and adults. Person-to-person heterogeneity in the measured titers is strong, specifically in the group of children's sera. The authors find that the fraction of sera with low titers correlates with the inferred growth rate using maximum likelihood regression (MLR), a correlation that does not hold for pooled sera. The authors then measure the protection profile of the sera against historical vaccine strains and find that it can be explained by birth cohort for children. Finally, the authors present data comparing pre- and post- vaccination protection profiles for 39 (USA) and 8 (Australia) adults. The data shows a cohort-specific vaccination effect as measured by the average titer increase, and also a virus-specific vaccination effect for the historical vaccine strains. The generated data is shared by the authors and they also note that these methods can be applied to inform the bi-annual vaccine composition meetings, which could be highly valuable.

      Thanks for this nice summary of our paper.

      The following points could be addressed in a revision:

      (1) The authors conclude that much of the person-to-person and strain-to-strain variation seems idiosyncratic to individual sera rather than age groups. This point is not yet fully convincing. While the mean titer of an individual may be idiosyncratic to the individual sera, the strain-to-strain variation still reveals some patterns that are consistent across individuals (the authors note the effects of substitutions at sites 145 and 275/276). A more detailed analysis, removing the individual-specific mean titer, could still show shared patterns in groups of individuals that are not necessarily defined by the birth cohort.

      As the reviewer suggests, we normalized the titers for all sera to the geometric mean titer for each individual in the US-based pre-vaccination adults and children. This is only for the 2023-circulating viral strains. We then faceted these normalized titers by the same age groups we used in Figure 6, and the resulting plot is shown. Although there are differences among virus strains (some are better neutralized than others), there are not obvious age group-specific patterns (eg, the trends in the two facets are similar). This observation suggests that at least for these relatively closely related recent H3N2 strains, the strain-to-strain variation does not obviously segregate by age group. Obviously, it is possible (we think likely) that there would be more obvious age-group specific trends if we looked at a larger swath of viral strains covering a longer time range (eg, over decades of influenza evolution). We have added the new plots shown as a Supplemental Figure 6 in the revised manuscript.

      (2) The authors show that the fraction of sera with a titer 138 correlates strongly with the inferred growth rate using MLR. However, the authors also note that there exists a strong correlation between the MLR growth rate and the number of HA1 mutations. This analysis does not yet show that the titers provide substantially more information about the evolutionary success. The actual relation between the measured titers and fitness is certainly more subtle than suggested by the correlation plot in Figure 5. For example, the clades A/Massachusetts and A/Sydney both have a positive fitness at the beginning of 2023, but A/Massachusetts has substantially higher relative fitness than A/Sydney. The growth inference in Figure 5b does not appear to map that difference, and the antigenic data would give the opposite ranking. Similarly, the clades A/Massachusetts and A/Ontario have both positive relative fitness, as correctly identified by the antigenic ranking, but at quite different times (i.e., in different contexts of competing clades). Other clades, like A/St. Petersburg are assigned high growth and high escape but remain at low frequency throughout. Some mention of these effects not mapped by the analysis may be appropriate.

      Thanks for the nice summary of our findings in Figure 5. However, the reviewer is misreading the growth charts when they say that A/Massachusetts/18/2022 has a substantially higher fitness than A/Sydney/332/2023. Figure 5a (reprinted at left panel) shows the frequency trajectory of different variants over time. While A/Massachusetts/18/2022 reaches a higher frequency than A/Sydney/332/2023, the trajectory is similar and the reason that A/Massachusetts/18/2022 reached a higher max frequency is that it started at a higher frequency at the beginning of 2023. The MLR growth rate estimates differ from the maximum absolute frequency reached: instead, they reflect how rapidly each strain grows relative to others. In fact, A/Massachusetts/18/2022 and A/Sydney/332/2023 have similar growth rates, as shown in Supplemental Figure 6b (reprinted at right). Similarly, A/Saint-Petersburg/RII-166/2023 starts at a low initial frequency but then grows even as A/Massachusetts/18/2022 and A/Sydney/332/2023 are declining, and so has a higher growth rate than both of those. 

      In the revised manuscript, we have clarified how viral growth rates are estimated from frequency trajectories, and how growth rate differs from max frequency in the text below:

      “To estimate the evolutionary success of different human H3N2 influenza strains during 2023, we used multinomial logistic regression, which analyzes strain frequencies over time to calculate strain-specific relative growth rates [51–53]. There were sufficient sequencing counts to reliably estimate growth rates in 2023 for 12 of the HAs for which we measured titers using our sequencing-based neutralization assay libraries (Figure 5a,b and Supplemental Figure 9a,b). Note that these growth rates estimate how rapidly each strain grows relative to the other strains, rather than the absolute highest frequency reached by each strain “.  

      (3) For the protection profile against the vaccine strains, the authors find for the adult cohort that the highest titer is always against the oldest vaccine strain tested, which is A/Texas/50/2012. However, the adult sera do not show an increase in titer towards older strains, but only a peak at A/Texas. Therefore, it could be that this is a virus-specific effect, rather than a property of the protection profile. Could the authors test with one older vaccine virus (A/Perth/16/2009?) whether this really can be a general property?

      We are interested in studying immune imprinting more thoroughly using sequencing-based neutralization assays, but we note that the adults in the cohorts we studied would have been imprinted with much older strains than included in this library. As this paper focuses on the relative fitness of contemporary strains with minor secondary points regarding imprinting, these experiments are beyond the scope of this study. We’re excited for future work (from our group or others) to explore these points by making a new virus library with strains from multiple decades of influenza evolution. 

      Reviewer #2 (Public review):

      This is an excellent paper. The ability to measure the immune response to multiple viruses in parallel is a major advancement for the field, which will be relevant across pathogens (assuming the assay can be appropriately adapted). I only have a few comments, focused on maximising the information provided by the sera.

      Thanks very much!

      Firstly, one of the major findings is that there is wide heterogeneity in responses across individuals. However, we could expect that individuals' responses should be at least correlated across the viruses considered, especially when individuals are of a similar age. It would be interesting to quantify the correlation in responses as a function of the difference in ages between pairs of individuals. I am also left wondering what the potential drivers of the differences in responses are, with age being presumably key. It would be interesting to explore individual factors associated with responses to specific viruses (beyond simply comparing adults versus children).

      We thank the reviewer for this interesting idea. We performed this analysis (and the related analyses described) and added this as a new Supplemental Figure 7, which is pasted after the response to the next related comment by the reviewer. 

      For 2023-circulating strains, we observed basically no correlation between the strength of correlation between pairs of sera and the difference in age between those pairs of sera (Supplemental Figure 7), which was unsurprising given the high degree of heterogeneity between individual sera (Figure 3, Supplemental Figure 6, and Supplemental Figure 8). For vaccine strains, there is a moderate negative correlation only in the children, but not in the adults or the combined group of adults and children. This could be because the children are younger with limited and potentially more similar vaccine and exposure histories than the adults. It could also be because the children are overall closer in age than the adults.

      Relatedly, is the phylogenetic distance between pairs of viruses associated with similarity in responses?

      For 2023-circulating strains, across sera cohorts we observed a weak-to-moderate correlation between the strength of correlation between the neutralizing titers across all sera to pairs of viruses and the Hamming distances between virus pairs. For the same comparison with vaccine strains, we observed moderate correlations, but this must be caveated with the slightly larger range of Hamming distances between vaccine strains. Notably, many of the points on the negative correlation slope are a mix of egg- and cell-produced vaccine strains from similar years, but there are some strain comparisons where the same year’s egg- and cell-produced vaccine strains correlate poorly.

      Figure 5C is also a really interesting result. To be able to predict growth rates based on titers in the sera is fascinating. As touched upon in the discussion, I suspect it is really dependent on the representativeness of the sera of the population (so, e.g., if only elderly individuals provided sera, it would be a different result than if only children provided samples). It may be interesting to compare different hypotheses - so e.g., see if a population-weighted titer is even better correlated with fitness - so the contribution from each individual's titer is linked to a number of individuals of that age in the population. Alternatively, maybe only the titers in younger individuals are most relevant to fitness, etc.

      We’re very interested in these analyses, but suggest they may be better explored in subsequent works that could sample more children, teenagers and adults across age groups. Our sera set, as the reviewer suggests, may be under-powered to perform the proposed analysis on subsetted age groups of our larger age cohorts. 

      In Figure 6, the authors lump together individuals within 10-year age categories - however, this is potentially throwing away the nuances of what is happening at individual ages, especially for the children, where the measured viruses cross different groups. I realise the numbers are small and the viruses only come from a small numbers of years, however, it may be preferable to order all the individuals by age (y-axis) and the viral responses in ascending order (x-axis) and plot the response as a heatmap. As currently plotted, it is difficult to compare across panels

      This is a good suggestion. In the revised manuscript we have included a heatmap of the children and pre-vaccination adults, ordered by the year of birth of each individual, as Supplemental figure 8. That new figure is also pasted in this response.

      Reviewer #3 (Public review):

      The authors use high-throughput neutralisation data to explore how different summary statistics for population immune responses relate to strain success, as measured by growth rate during the 2023 season. The question of how serological measurements relate to epidemic growth is an important one, and I thought the authors present a thoughtful analysis tackling this question, with some clear figures. In particular, they found that stratifying the population based on the magnitude of their antibody titres correlates more with strain growth than using measurements derived from pooled serum data. However, there are some areas where I thought the work could be more strongly motivated and linked together. In particular, how the vaccine responses in US and Australia in Figures 6-7 relate to the earlier analysis around growth rates, and what we would expect the relationship between growth rate and population immunity to be based on epidemic theory.

      Thank you for this nice summary. This reviewer also notes that the text related to figures 6 and 7 are more secondary to the main story presented in figures 3-5. The main motivation for including figures 6 and 7 were to demonstrate the wide-ranging applications of sequencing-based neutralization data. We have tried to clarify this with the following minor text revisions, which do not add new content but we hope smooth the transition between results sections. 

      While the preceding analyses demonstrated the utility of sequencing-based neutralization assays for measuring titers of currently circulating strains, our library also included viruses with HAs from each of the H3N2 influenza Northern Hemisphere vaccine strains from the last decade (2014 to 2024, see Supplemental Table 1). These historical vaccine strains cover a much wider span of evolutionary diversity that the 2023-circulating strains analyzed in the preceding sections (Figure 2a,b and Supplemental Figure 2b-e). For this analysis, we focused on the cell-passaged strains for each vaccine, as these are more antigenically similar to their contemporary circulating strains than the egg-passaged vaccine strains since they lack the mutations that arise during growth of viruses in eggs [55–57] (Supplemental Table 1). 

      Our sequencing-based assay could also be used to assess the impact of vaccination on neutralization titers against the full set of strains in our H3N2 library. To do this, we analyzed matched 28-day post-vaccination samples for each of the above-described 39 pre-vaccination samples from the cohort of adults based in the USA (Table 1). We also analyzed a smaller set of matched pre- and post-vaccination sera samples from a cohort of eight adults based in Australia (Table 1). Note that there are several differences between these cohorts: the USA-based cohort received the 2023-2024 Northern Hemisphere egg-grown vaccine whereas the Australia-based cohort received the 2024 Southern Hemisphere cell-grown vaccine, and most individuals in the USA-based cohort had also been vaccinated in the prior season whereas most individuals in the Australia-based cohort had not. Therefore, multiple factors could contribute to observed differences in vaccine response between the cohorts.

      Reviewer #3 (Recommendations for the authors):

      Main comments:

      (1) The authors compare titres of the pooled sera with the median titres across individual sera, finding a weak correlation (Figure 4). I was therefore interested in the finding that geometric mean titre and median across a study population are well correlated with growth rate (Supplemental Figure 6c). It would be useful to have some more discussion on why estimates from a pool are so much worse than pooled estimates.

      We thank this reviewer for this point. We would clarify that pooling sera is the equivalent of taking the arithmetic mean of the individual sera, rather than the geometric mean or median, which tends to bias the measurements of the pool to the outliers within the pool. To address this reviewer’s point, we’ve added the following text to the manuscript:

      “To confirm that sera pools are not reflective of the full heterogeneity of their constituent sera, we created equal volume pools of the children and adult sera and measured the titers of these pools using the sequencing-based neutralization assay. As expected, neutralization titers of the pooled sera were always higher than the median across the individual constituent sera, and the pool titers against different viral strains were only modestly correlated with the median titers across individual sera (Figure 4). The differences in titers across strains were also compressed in the serum pools relative to the median across individual sera (Figure 4). The failure of the serum pools to capture the median titers of all the individual sera is especially dramatic for the children sera (Figure 4) because these sera are so heterogeneous in their individual titers (Figure 3b). Taken together, these results show that serum pools do not fully represent individual-level heterogeneity, and are similar to taking the arithmetic mean of the titers for a pool of individuals, which tends to be biased by the highest titer sera”.

      (2) Perhaps I missed it, but are growth rates weekly growth rates? (I assume so?)

      The growth rates are relative exponential growth rates calculated assuming a serial interval of 3.6 days. We also added clarifying language and a citation for the serial growth interval to the methods section:

      The analysis performing H3 HA strain growth rate estimates using the evofr[51] package is at https://github.com/jbloomlab/flu_H3_2023_seqneut_vs_growth. Briefly, we sought to make growth rate estimates for the strains in 2023 since this was the same timeframe when the sera were collected. To achieve this, we downloaded all publicly-available H3N2 sequences from the GISAID[88] EpiFlu database, filtering to only those sequences that closely matched a library HA1 sequence (within one HA1 amino-acid mutation) and were collected between January 2023 and December 2023. If a sequence was within one HA1 amino-acid mutation of multiple library HA1 proteins then it was assigned to the closest one; if there were multiple equally close matches then it was assigned fractionally to each match. We only made growth rate estimates for library strains with at least 80 sequencing counts (Supplemental Figure 9a), and ignored counts for sequences that did not match a library strain (equivalent results were obtained if we instead fit a growth rate for these sequences as an “other” category). We then fit multinomial logistic regression models using the evofr[51] package assuming a serial interval of 3.6 days[101]  to the strain counts. For the plot in Figure 5a the frequencies are averaged over a 14-day sliding window for visual clarity, but the fits were to the raw sequencing counts. For most of the analyses in this paper we used models based on requiring 80 sequencing counts to make an estimate for strain growth rates, and counting a sequence as a match if it was within one amino-acid mutation; see https://jbloomlab.github.io/flu_H3_2023_seqneut_vs_growth/ for comparable analyses using different reasonable sequence count cutoffs (e.g., 60, 50, 40 and 30, as depicted in Supplemental Figure 9).  Across sequence cutoffs, we found that the fraction of individuals with low neutralization titers and number of HA1 mutations correlated strongly with these MLR-estimated strain growth rates.

      (3)  I found Figure 3 useful in that it presents phylogenetic structure alongside titres, to make it clearer why certain clusters of strains have a lower response. In contrast, I found it harder to meaningfully interpret Figure 7a beyond the conclusion that vaccines lead to a fairly uniform rise in titre. Do the 275 or 276 mutations that seem important for adults in Figure 3 have any impact?

      We are certainly interested in the questions this reviewer raises, and in trying to understand how well a seasonal vaccine protects against the most successful influenza variants that season. However, these post-vaccination sera were taken when neutralizing titers peak ~30 days after vaccination. Because of this, in the larger cohort of US-based post-vaccination adults, the median titers across sera to most strains appear uniformly high. In the Australian-based post-vaccination adults, there was some strain-to-strain variation in median titers across sera, but of course this must be caveated with the much smaller sample size. It might be more relevant to answer this question with longitudinally sampled sera, when titers begin to wane in the following months.

      (4)  It could be useful to define a mechanistic relationship about how you would expect susceptibility (e.g. fraction with titre < X, where X is a good correlate) to relate to growth via the reproduction number: R = R0 x S. For example, under the assumption the generation interval G is the same for all, we have R = exp(r*G), which would make it possible to make a prediction about how much we would expect the growth rate to change between S = 0.45 and 0.6, as in Fig 5c. This sort of brief calculation (or at least some discussion) could add some more theoretical underpinning to the analysis, and help others build on the work in settings with different fractions with low titres. It would also provide some intuition into whether we would expect relationships to be linear.

      This is an interesting idea for future work! However, the scope of our current study is to provide these experimental data and show a correlation with growth; we hope this can be used to build more mechanistic models in future.

      (5) A key conclusion from the analysis is that the fraction above a threshold of ~140 is particularly informative for growth rate prediction, so would it be worth including this in Figure 6-7 to give a clearer indication of how much vaccination reduces contribution to strain growth among those who are vaccinated? This could also help link these figures more clearly with the main analysis and question.

      Although our data do find ~140 to be the threshold that gives max correlation with growth rate, we are not comfortable strongly concluding 140 is a correlate of protection, as titers could influence viral fitness without completely protecting against infection. In addition, inspection of Figure 5d shows that while ~140 does give the maximal correlation, a good correlation is observed for most cutoffs in the range from ~40 to 200, so we are not sure how robustly we can be sure that ~140 is the optimal threshold.

      (6)  In Figure 5, the caption doesn't seem to include a description for (e).

      Thank you to the reviewer for catching this – this is fixed now.

      (7)  The US vs Australia comparison could have benefited from more motivation. The authors conclude ,"Due to the multiple differences between cohorts we are unable to confidently ascribe a cause to these differences in magnitude of vaccine response" - given the small sample sizes, what hypotheses could have been tested with these data? The comparison isn't covered in the Discussion, so it seems a bit tangential currently.

      Thank you to the reviewer for this comment, but we should clarify our aim was not to directly compare US and Australian adults. We are interested in regional comparisons between serum cohorts, but did not have the numbers to adequately address those questions here. This section (and the preceding question) were indeed both intended to be tangential to the main finding, and hopefully this will be clarified with our text additions in response to Reviewer #3’s public reviews.

    1. Author response:

      We thank the reviewers for their time and their constructive comments.

      Reviewer 1 makes several incisive comments about the single-cell RNA-sequencing dataset used in this  version of the manuscript, which was previously published in Colquitt, 2021. The Reviewer correctly  notes that this dataset consists primarily of nuclei from zebra finches, with a relatively small proportion of  the data coming from Bengalese finches. However, all other data presented here comes from assays and  experiments in Bengalese finches. This discrepancy could lead to two issues of interpretation. First, there  could be substantive expression differences in the CRH signaling pathway between these two species,  making it difficult to interpret its cellular expression profile. Second, the Reviewer describes that in their  reanalysis of this dataset they determined that what had been described as distinct cell types – namely  HVC-Glut-1 vs. HVC-Glut-4 (corresponding to the HVC  RA  projection neurons) and the three RA-Glut  types – are likely to be single cell types. The Reviewer notes that inter-individual differences in gene  expression, which were not analyzed in the original publication, could have generated this apparent cell  type diversity.

      To the first point, we agree that the use of the published dataset that consists primarily of zebra finch  data is not ideal when making claims of cell type-specific expression in Bengalese finches. To rectify this  issue, we have generated additional sets of snRNA-seq from Bengalese finches that encompass multiple  areas of the song system as well as adjacent comparator regions outside of the principal song areas.  Our initial analysis of these datasets indicates that the cellular patterns of expression of the CRH system  is consistent with what has been presented here. In our revision, we will include a reanalysis of  neuropeptide expression using these more extensive datasets.

      To the second point, we also agree that some of the instances of glutamatergic neuron diversity could  have been generated either by issues stemming from the integration of two species or through  interindividual differences. In our analysis of our newer snRNA-seq data, we also identify a single HVC  RA  projection neuron type (not two) and that RA projection neuron types fall into one or two classes (not  three), similar to what Reviewer 1 described. We have deconvolved these datasets by genotype, as  suggested by the Reviewer, and do not see substantial interindividual variation across the CRH system.  However, our revision will explicitly address these issues.

      Reviewer 1 also brings up several important questions concerning the relationships between CRHBP  and singing and the challenge of interpreting the influences of song acquisition and deafening on CRHBP  expression, given the variation in singing that generally accompanies these changes to song. To address  in part this issue, our regression analysis of deafening-associated gene expression differences includes  a term for the number of songs sung on the day of euthanasia as well as an interaction term between  song destabilization and singing amount. This design controls for the amount that a bird sang in the  period before brain collection. This analysis was included in  (Colquitt et al., 2023) , and will be further  elaborated and discussed in the revised version of this manuscript. Notably, CRHBP expression shows a  significant interaction between song destabilization and singing amount, suggesting that reduction of  CRHBP following deafening is greater than what would be expected from any reductions in singing  alone. This specific analysis will be included in the revised manuscript as well.

      However, despite these statistical controls, we cannot fully rule out that singing is playing a fundamental  role in driving the CRHBP expression differences we see across conditions. Indeed, a number of studies  have described an association between the amount a bird sings and the variability of its song  (Chen et  al., 2013; Hayase et al., 2018; Hilliard et al., 2012; Miller et al., 2010; Ohgushi et al., 2015) , with a general trend of higher amounts of singing correlated with a reduction in variability. This relationship is  consistent with what we see for CRHBP expression in RA and HVC: high in unmanipulated adult males  and decreased during states of high variability and plasticity (post-deafening and juveniles). A model that  combines these observations, and that we will include in the Discussion of the revised manuscript, is one  in which singing induces the expression of CRHBP in RA and HVC, limiting CRH binding to its receptors,  thereby limiting this pathway’s proposed effects on the excitability and synaptic plasticity of projection  neurons.

      Reviewer 2 suggests multiple interesting avenues to more fully characterize the role of the CRH pathway  in song performance and learning. First, we agree that HVC is a compelling target to investigate CRH’s  role in song, given the similarity of CRHBP expression in HVC and RA across deafening, song  acquisition, and singing. As the Reviewer notes, a number of studies have demonstrated key roles for  interneurons in shaping neuronal dynamics in HVC and regulating song structure. Here, we focused on  RA due to the direct influence of RA projection neurons have on syringeal and respiration motoneurons  controlling song production, and the following expectation that manipulations of CRH signaling in this  region would have particularly measurable effects on song.  However, we agree with the reviewer that it  would be of additional interest to investigate manipulations of CRH signalling in HVC.  We are  considering whether it will be feasible given the usual constraints of time, personnel, and other  competing demands to carry such experiments as an addition to the current manuscript. Depending on  how that goes, we will either add new experimental data to the manuscript, or simply acknowledge the  interest of such experiments in Discussion and defer their pursuit to future study.

      Likewise, Reviewer 2 suggests other ways in which an understanding of the role of CRH signalling could  be further enriched with additional experiments, including investigating the influence of CRH signaling on  song acquisition, when song transitions from a variable and plastic state to a precise and stereotyping  state, and pursuing direct evidence that CRH influences the neurophysiology of glutamatergic neurons in  HVC or RA. These are both excellent suggestions for ways in neuropeptide signalling could be further  linked to alterations in behavior; As we proceed with revisions we will consider whether we can address  some of these suggestions within the scope of the current manuscript, versus note them in discussion as  directions for future research.

      Chen Q, Heston JB, Burkett ZD, White SA. 2013. Expression analysis of the speech-related genes  FoxP1 and FoxP2 and their relation to singing behavior in two songbird species.  J Exp Biol  216 :3682–3692. doi:10.1242/jeb.085886

      Colquitt BM, Li K, Green F, Veline R, Brainard MS. 2023. Neural circuit-wide analysis of changes to gene  expression during deafening-induced birdsong destabilization.  Elife  12 :e85970. doi:10.7554/eLife.85970

      Hayase S, Wang H, Ohgushi E, Kobayashi M, Mori C, Horita H, Mineta K, Liu W-C, Wada K. 2018. Vocal  practice regulates singing activity-dependent genes underlying age-independent vocal learning in  songbirds.  PLoS Biol 16 :e2006537. doi:10.1371/journal.pbio.2006537

      Hilliard AT, Miller JE, Fraley ER, Horvath S, White SA. 2012. Molecular microcircuitry underlies functional  specification in a basal ganglia circuit dedicated to vocal learning.  Neuron  73 :537–552.  doi:10.1016/j.neuron.2012.01.005

      Miller JE, Hilliard AT, White SA. 2010. Song practice promotes acute vocal variability at a key stage of  sensorimotor learning.  PLoS One  5 :e8592. doi:10.1371/journal.pone.0008592

      Ohgushi E, Mori C, Wada K. 2015. Diurnal oscillation of vocal development associated with clustered  singing by juvenile songbirds.  J Exp Biol  218 :2260–2268.  doi:10.1242/jeb.115105

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This study presents an exploration of PPGL tumour bulk transcriptomics and identifies three clusters of samples (labeled as subtypes C1-C3). Each subtype is then investigated for the presence of somatic mutations, metabolism-associated pathways and inflammation correlates, and disease progression. The proposed subtype descriptions are presented as an exploratory study. The proposed potential biomarkers from this subtype are suitably caveated and will require further validation in PPGL cohorts together with a mechanistic study.  

      The first section uses WGCNA (a method to identify clusters of samples based on gene expression correlations) to discover three transcriptome-based clusters of PPGL tumours. The second section inspects a previously published snRNAseq dataset, and labels some of the published cells as subtypes C1, C2, C3 (Methods could be clarified here), among other cells labelled as immune cell types. Further details about how the previously reported single-nuclei were assigned to the newly described subtypes C1-C3 require clarification.

      Thank you for your valuable suggestion. In response to the reviewer’s request for further clarification on “how previously published single-nuclei data were assigned to the newly defined C1-C3 subtypes,” we have provided additional methodological details in the revised manuscript (lines 103-109). Specifically, we aggregated the single-nucleus RNA-seq data to the sample level by summing gene counts across nuclei to generate pseudo-bulk expression profiles. These profiles were then normalized for library size, log-transformed (log1p), and z-scaled across samples. Using genesets scores derived from our earlier WGCNA analysis of PPGLs, we defined transcriptional subtypes within the Magnus cohort (Supplementary Figure. 1C). We further analyzed the single-nucleus data by classifying malignant (chromaffin) nuclei as C1, C2, or C3 based on their subtype scores, while non-malignant nuclei (including immune, stromal, endothelial, and others) were annotated using canonical cell-type markers (Figure. 4A). 

      The tumour samples are obtained from multiple locations in the body (Figure 1A). It will be important to see further investigation of how the sample origin is distributed among the C1C3 clusters, and whether there is a sample-origin association with mutational drivers and disease progression.

      Thank you for your valuable suggestion. In the revised manuscript (lines 74-79), Figure. 1A, Table S1 and Supplementary Figure. 1A, we harmonized anatomic site annotations from our PPGL cohort and the TCGA cohort and analyzed the distribution of tumor origin (adrenal vs extra-adrenal) across subtypes. The site composition is essentially uniform across C1-C3— approximately 75% pheochromocytoma (PC) and 25% paraganglioma (PG)—with only minimal variation. Notably, the proportion of extra-adrenal origin (paraganglioma origin) is slightly higher in the C1 subtype (see Supplementary Figure 1A), which aligns with the biological characteristics of tumors from this anatomical site, which typically exhibit more aggressive behavior.

      Reviewer #2 (Public Review):

      A study that furthers the molecular definition of PPGL (where prognosis is variable) and provides a wide range of sub-experiments to back up the findings. One of the key premises of the study is that identification of driver mutations in PPGL is incomplete and that compromises characterisation for prognostic purposes. This is a reasonable starting point on which to base some characterisation based on different methods. The cohort is a reasonable size, and a useful validation cohort in the form of TCGA is used. Whilst it would be resource-intensive (though plausible given the rarity of the tumour type) to perform RNA-seq on all PPGL samples in clinical practice, some potential proxies are proposed.

      We sincerely thank the reviewer for their positive assessment of our study’s rationale. We fully agree that RNA sequencing for all PPGL samples remains resource-intensive in current clinical practice, and its widespread application still faces feasibility challenges. It is precisely for this reason that, after defining transcriptional subtypes, we further focused on identifying and validating practical molecular markers and exploring their detectability at the protein level.

      In this study, we validated key markers such as ANGPT2, PCSK1N, and GPX3 using immunohistochemistry (IHC), demonstrating their ability to effectively distinguish among molecular subtypes (see Figure. 5). This provides a potential tool for the clinical translation of transcriptional subtyping, similar to the transcription factor-based subtyping in small cell lung cancer where IHC enables low-cost and rapid molecular classification.

      It should be noted that the subtyping performance of these markers has so far been preliminarily validated only in our internal cohort of 87 PPGL samples. We agree with the reviewer that largerscale, multi-center prospective studies are needed in the future to further establish the reliability and prognostic value of these markers in clinical practice.

      The performance of some of the proxy markers for transcriptional subtype is not presented.

      We agree with your comment regarding the need to further evaluate the performance of proxy markers for transcriptional subtyping. In our study, we have in fact taken this point into full consideration. To translate the transcriptional subtypes into a clinically applicable classification tool, we employed a linear regression model to compare the effect values (β values) of candidate marker genes across subtypes (Supplementary Figure. 1D-F). Genes with the most significant β values and statistical differences were selected as representative markers for each subtype.

      Ultimately, we identified ANGPT2, PCSK1N, and GPX3—each significantly overexpressed in subtypes C1, C2, and C3, respectively, and exhibiting the most pronounced β values—as robust marker genes for these subtypes (Figure. 5A and Supplementary Figure. 1D-F). These results support the utility of these markers in subtype classification and have been thoroughly validated in our analysis.

      There is limited prognostic information available.

      Thank you for your valuable suggestion. In this exploratory revision, we present the available prognostic signal in Figure. 5C. Given the current event numbers and follow-up time, we intentionally limited inference. We are continuing longitudinal follow-up of the PPGL cohort and will periodically update and report mature time-to-event analyses in subsequent work.

      Reviewer #1 (Recommendations for the authors):

      There is no deposition reference for the RNAseq transcriptomics data. Have the data been deposited in a suitable data repository?

      Thank you for your valuable suggestion. We have updated the Data availability section (lines 508–511) to clarify that the bulk-tissue RNA-seq datasets generated in this study are available from the corresponding author upon reasonable request.

      In the snRNAseq analysis of existing published data, clarify how cells were labelled as "C1", "C2", "C3", alongside cells labelled by cell type (the latter is described briefly in the Methods).

      Thank you for your valuable suggestion. In response to the reviewer’s request for further clarification on “how previously published single-nuclei data were assigned to the newly defined C1-C3 subtypes,” we have provided additional methodological details in the revised manuscript (lines 103-109). Specifically, we aggregated the single-nucleus RNA-seq data to the sample level by summing gene counts across nuclei to generate pseudo-bulk expression profiles. These profiles were then normalized for library size, log-transformed (log1p), and z-scaled across samples. Using genesets scores derived from our earlier WGCNA analysis of PPGLs, we defined transcriptional subtypes within the Magnus cohort (Supplementary Figure. 1C). We further analyzed the single-nucleus data by classifying malignant (chromaffin) nuclei as C1, C2, or C3 based on their subtype scores, while non-malignant nuclei (including immune, stromal, endothelial, and others) were annotated using canonical cell-type markers (Figure. 4A).

      Package versions should be included (e.g., CellChat, monocle2).

      We greatly appreciate your comments and have now added a dedicated “Software and versions” subsection in Methods. Specifically, we report Seurat (v4.4.0), sctransform (v0.4.2), CellChat (v2.2.0), monocle (v2.36.0; monocle2), pheatmap (v1.0.13), clusterProfiler (v4.16.0), survival (v3.8.3), and ggplot2 (v3.5.2) (lines 514-516). We also corrected a typographical error (“mafools” → “maftools”) (lines 463).

      Reviewer #2 (Recommendations for the authors):

      It would be helpful to provide a little more detail on the clinical composition of the cohort (e.g., phaeo vs paraganglioma, age, etc.) in the text, acknowledging that this is done in Figure 1.

      Thank you for your valuable suggestion. In the revision, we added Table S1 that provides a detailed summary of the clinical composition of the PPGL cohort. Specifically, we report the numbers and proportions (Supplementary Figure. 1A) of pheochromocytoma (PC) versus paraganglioma (PG), further subclassifying PG into head and neck (HN-PG), retroperitoneal (RPPG), and bladder (BC-PG).

      How many of each transcriptional subtype had driver mutations (germline or somatic)? This is included in the figures but would be worth mentioning in the text. Presumably, some of these may be present but not detected (e.g., non-coding variants), and this should be commented on. It is feasible that if methods to detect all the relevant genomic markers were improved, then the rate of tumours without driver mutations would be less and their prognostic utility would be more comprehensive.

      Thank you for your valuable suggestion. In the revision (lines 113–116), we now report the prevalence of driver mutations (germline or somatic) overall and by transcriptional subtype. We analyzed variant data across 84 PPGL-relevant genes from 179 tumors in the TCGA cohort and 30 tumors in Magnus’s cohort (Fig. 2A; Table S2). High-frequency genes were consistent with known biology—C1 enriched for [e.g., VHL/SDHB], C2 for [e.g., RET/HRAS], and C3 for [e.g., SDHA/SDHD]. We also note that a subset of tumors lacked an identifiable driver, which likely reflects current assay limitations (e.g., non-coding or structural variants, subclonality, and purity effects). Broader genomic profiling (deep WGS/long-read, RNA fusion, methylation) would be expected to reduce the “driver-negative” fraction and further enhance the prognostic utility of these classifiers.

      ANGPT2 provides a reasonable predictive capacity for the C1 subtype as defined by the ROC AUC. What was the performance of the PCSK1N and GPX3 as markers of the other subtypes?

      We agree with your comment regarding the need to further evaluate the performance of proxy markers for transcriptional subtyping, and we have supplemented the analysis with ROC and AUC values for two additional parameters (Author response image 1 , see below). Furthermore, in our study, we have in fact taken this point into full consideration. To translate the transcriptional subtypes into a clinically applicable classification tool, we employed a linear regression model to compare the effect values (β values) of candidate marker genes across subtypes (Supplementary Figure. 1D-F). Genes with the most significant β values and statistical differences were selected as representative markers for each subtype.

      Ultimately, we identified ANGPT2, PCSK1N, and GPX3—each significantly overexpressed in subtypes C1, C2, and C3, respectively, and exhibiting the most pronounced β values—as robust marker genes for these subtypes (Figure. 5A and Supplementary Figure. 1D-F). These results support the utility of these markers in subtype classification and have been thoroughly validated in our analysis.

      Author response image 1.

      Extended Data Figure A-B. (A) The ROC curve illustrates the diagnostic ability to distinguish PCSK1N expression in PPGLs, specifically differentiating subtype C2 from non-C2 subtypes. The red dot indicates the point with the highest sensitivity (93.1%) and specificity (82.8%). AUC, the area under the curve. (B) The ROC curve illustrates the diagnostic ability to distinguish GPX3 expression in PPGLs, specifically differentiating subtype C3 from non-C3 subtypes. The red dot indicates the point with the highest sensitivity (83.0%) and specificity (58.8%). AUC, the area under the curve.

      In the discussion, I think it would be valuable to summarise existing clinical/molecular predictors in PPGL and, acknowledging that their performance may be limited, compare them to the potential of these novel classifiers.

      Thank you for your valuable suggestion. We have added a concise overview of established clinical and molecular predictors in PPGL and compared them with the potential of our transcriptional classifiers. The new paragraph (Discussion, lines 315–338) now reads:

      “Compared to existing clinical and molecular predictors, risk assessment in PPGL has long relied on the following indicators: clinicopathological features (e.g., tumor size, non-adrenal origin, specific secretory phenotype, Ki-67 index), histopathological scoring systems (such as PASS/GAPP), and certain genetic alterations (including high-risk markers like SDHB inactivation mutations, as well as susceptibility gene mutations in ATRX, TERT promoter, MAML3, VHL, NF1, among others). Although these metrics are highly actionable in clinical practice, they exhibit several limitations: first, current molecular markers only cover a subset of patients, and technical constraints hinder the detection of many potentially significant variants (e.g., non-coding mutations), thereby compromising the comprehensiveness of prognostic evaluation; second, histopathological scoring is susceptible to interobserver variability; furthermore, the lack of standardized detection and evaluation protocols across institutions limits the comparability and generalizability of results. Our transcriptomic classification system—comprising C1 (pseudohypoxic/angiogenic signature), C2 (kinase-signaling signature), and C3 (SDHx-related signature)—provides a complementary approach to PPGL risk assessment. These subtypes reflect distinct biological backgrounds tied to specific genetic alterations and can be approximated by measuring the expression of individual genes (e.g., ANGPT2, PCSK1N, or GPX3). This study demonstrates that the classifier offers three major advantages: first, it accurately distinguishes subtypes with coherent biological features; second, it retains significant predictive value even after adjusting for clinical covariates; third, it can be implemented using readily available assays such as immunohistochemistry. These findings suggest that integrating transcriptomic subtyping with conventional clinical markers may offer a more comprehensive and generalizable risk stratification framework. However, this strategy would require validation through multi-center prospective studies and standardization of detection protocols.”

      A little more explanation of the principles behind WGCNA would be useful in the methods.

      We are grateful for your comments. We have expanded the Methods to briefly explain the principles of WGCNA (lines 426-454). In short, WGCNA constructs a weighted coexpression network from normalized gene expression, identifies modules of tightly co-expressed genes, summarizes each module by its eigengene (the first principal component), and then correlates module eigengenes with phenotypes (e.g., transcriptional subtypes) to highlight biologically meaningful gene sets and candidate hub genes. We now specify our preprocessing, choice of softthresholding power to approximate scale-free topology, module detection/merging criteria, and the statistics used for module–trait association and downstream gene-set scoring. 

      On line 234, I think the figure should be 5C?

      We greatly appreciate your comments and Correct to Figure 5C.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Weakness:

      I wonder how task difficulty and linguistic labels interact with the current findings. Based on the behavioral data, shapes with more geometric regularities are easier to detect when surrounded by other shapes. Do shape labels that are readily available (e.g., "square") help in making accurate and speedy decisions? Can the sensitivity to geometric regularity in intraparietal and inferior temporal regions be attributed to differences in task difficulty? Similarly, are the MEG oddball detection effects that are modulated by geometric regularity also affected by task difficulty?

      We see two aspects to the reviewer’s remarks.

      (1) Names for shapes.

      On the one hand, is the question of the impact of whether certain shapes have names and others do not in our task. The work presented here is not designed to specifically test the effect of formal western education; however, in previous work (Sablé-Meyer et al., 2021), we noted that the geometric regularity effect remains present even for shapes that do not have specific names, and even in participants who do not have names for them. Thus, we replicated our main effects with both preschoolers and adults that did not attend formal western education and found that our geometric feature model remained predictive of their behavior; we refer the reader to this previous paper for an extensive discussion of the possible role of linguistic labels, and the impact of the statistics of the environment on task performance.  

      What is more, in our behavior experiments we can discard data from any shape that is has a name in English and run our model comparison again. Doing so diminished the effect size of the geometric feature model, but it remained predictive of human behavior: indeed, if we removed all shapes but kite, rightKite, rustedHinge, hinge and random (i.e., more than half of our data, and shapes for which we came up with names but there are no established names), we nevertheless find that both models significantly correlate with human behavior—see plot in Author response image 1, equivalent of our Fig. 1E with the remaining shapes.

      Author response image 1.

      An identical analysis on the MEG leads to two noisy but significant clusters (CNN: 64.0ms to 172.0ms; then 192.0ms to 296.0ms; both p<.001: Geometric Features: 312.0ms to 364.0ms with p=.008). We have improved our manuscript thanks to the reviewer’s observation by adding a figure with the new behavior analysis to the supplementary figures and in the result section of the behavior task. We now refer to these analysis where appropriate:

      (intro) “The effect appeared as a human universal, present in preschoolers, first-graders, and adults without access to formal western math education (the Himba from Namibia), and thus seemingly independent of education and of the existence of linguistic labels for regular shapes.”

      (behavior results) “Finally, to separate the effect of name availability and geometric features on behavior, we replicated our analysis after removing the square, rectangle, trapezoids, rhombus and parallelogram from our data (Fig. S5D). This left us with five shapes, and an RDM with 10 entries, When regressing it in a GLM with our two models, we find that both models are still significant predictors (p<.001). The effect size of the geometric feature model is greatly reduced, yet remained significantly higher than that of the neural network model (p<.001).”

      (meg results) “This analysis yielded similar clusters when performed on a subset of shapes that do not have an obvious name in English, as was the case for the behavior analysis (CNN Encoding: 64.0ms to 172.0ms; then 192.0ms to 296.0ms; both p<.001: Geometric Features: 312.0ms to 364.0ms with p=.008).”

      (discussion, end of behavior section) “Previously, we only found such a significant mixture of predictors in uneducated humans (whether French preschoolers or adults from the Himba community, mitigating the possible impact of explicit western education, linguistic labels, and statistics of the environment on geometric shape representation) (Sablé-Meyer et al., 2021).”

      Perhaps the referee’s point can also be reversed: we provide a normative theory of geometric shape complexity which has the potential to explain why certain shapes have names: instead of seeing shape names as the cause of their simpler mental representation, we suggest that the converse could occur, i.e. the simpler shapes are the ones that are given names.

      (2) Task difficulty

      On the other hand is the question of whether our effect is driven by task difficulty. First, we would like to point out that this point could apply to the fMRI task, which asks for an explicit detection of deviants, but does not apply to the MEG experiment. In MEG, participants passively looked at sequences of shapes which, for a given block, comprising many instances of a fixed standard shape and rare deviants–even if they notice deviants, they have no task related to them. Yet two independent findings validated the geometric features model: there was a large effect of geometric regularity on the MEG response to deviants, and the MEG dissimilarity matrix between standard shapes correlated with a model based on geometric features, better than with a model based on CNNs. While the response to rare deviants might perhaps be attributed to “difficulty” (assuming that, in spite of the absence of an explicit task, participants try to spot the deviants and find this self-imposed task more difficult in runs with less regular shapes), it seems very hard to explain the representational similarity analysis (RSA) findings based on difficulty. Indeed, what motivated us to use RSA analysis in both fMRI and MEG was to stop relying on the response to deviants, and use solely the data from standard or “reference” shapes, and model their neural response with theory-derived regressors.

      We have updated the manuscript in several places to make our view on these points clearer:

      (experiment 4) “This design allowed us to study the neural mechanisms of the geometric regularity effect without confounding effects of task, task difficulty, or eye movements.”

      (figure 4, legend) “(A) Task structure: participants passively watch a constant stream of geometric shapes, one per second (presentation time 800ms). The stimuli are presented in blocks of 30 identical shapes up to scaling and rotation, with 4 occasional deviant shape. Participants do not have a task to perform beside fixating.”

      Reviewer #2 (Public review):

      Weakness:

      Given that the primary take away from this study is that geometric shape information is found in the dorsal stream, rather than the ventral stream there is very little there is very little discussion of prior work in this area (for reviews, see Freud et al., 2016; Orban, 2011; Xu, 2018). Indeed, there is extensive evidence of shape processing in the dorsal pathway in human adults (Freud, Culham, et al., 2017; Konen & Kastner, 2008; Romei et al., 2011), children (Freud et al., 2019), patients (Freud, Ganel, et al., 2017), and monkeys (Janssen et al., 2008; Sereno & Maunsell, 1998; Van Dromme et al., 2016), as well as the similarity between models and dorsal shape representations (Ayzenberg & Behrmann, 2022; Han & Sereno, 2022).

      We thank the reviewer for this opportunity to clarify our writing. We want to use this opportunity to highlight that our primary finding is not about whether the shapes of objects or animals (in general) are processed in the ventral versus or the dorsal pathway, but rather about the much more restricted domain of geometric shapes such as squares and triangles. We propose that simple geometric shapes afford additional levels of mental representation that rely on their geometric features – on top of the typical visual processing. To the best of our knowledge, this point has not been made in the above papers.

      Still, we agree that it is useful to better link our proposal to previous ones. We have updated the discussion section titled “Two Visual Pathways” to include more specific references to the literature that have reported visual object representations in the dorsal pathway. Following another reviewer’s observation, we have also updated our analysis to better demonstrate the overlap in activation evoked by math and by geometry in the IPS, as well as include a novel comparison with independently published results.

      Overall, to address this point, we (i) show the overlap between our “geometry” contrast (shape > word+tools+houses) and our “math” contrast (number > words); (ii) we display these ROIs side by side with ROIs found in previous work (Amalric and Dehaene, 2016), and (iii) in each math-related ROIs reported in that article, we test our “geometry” (shape > word+tools+houses) contrast and find almost all of them to be significant in both population; see Fig. S5.

      Finally, within the ROIs identified with our geometry localizer, we also performed similarity analyses: for each region we extracted the betas of every voxel for every visual category, and estimated the distance (cross-validated mahalanobis) between different visual categories. In both ventral ROIs, in both populations, numbers were closer to shapes than to the other visual categories including text and Chinese characters (all p<.001). In adults, this result also holds for the right ITG (p=.021) and the left IPS (p=.014) but not the right IPS (p=.17). In children, this result did not hold in the areas.

      Naturally, overlap in brain activation does not suffice to conclude that the same computational processes are involved. We have added an explicit caveat about this point. Indeed, throughout the article,  we have been careful to frame our results in a way that is appropriate given our evidence, e.g. saying “Those areas are similar to those active during number perception, arithmetic, geometric sequences, and the processing of high-level math concepts” and “The IPS areas activated by geometric shapes overlap with those active during the comprehension of elementary as well as advanced mathematical concepts”. We have rephrased the possibly ambiguous “geometric shapes activated math- and number-related areas, particular the right aIPS.” into “geometric shapes activated areas independently found to be activated by math- and number-related tasks, in particular the right aIPS”.

      Reviewer #3 (Public review):

      Weakness:

      Perhaps the manuscript could emphasize that the areas recruited by geometric figures but not objects are spatial, with reduced processing in visual areas. It also seems important to say that the images of real objects are interpreted as representations of 3D objects, as they activate the same visual areas as real objects. By contrast, the images of geometric forms are not interpreted as representations of real objects but rather perhaps as 2D abstractions.

      This is an interesting possibility. Geometric shapes are likely to draw attention to spatial dimensions (e.g. length) and to do so in a 2D spatial frame of reference rather than the 3D representations evoked by most other objects or images. However, this possibility would require further work to be thoroughly evaluated, for instance by comparing usual 3D objects with rare instances of 2D ones (e.g. a sheet of paper, a sticker etc). In the absence of such a test, we refrained from further speculation on this point.

      The authors use the term "symbolic." That use of that term could usefully be expanded here.  

      The reviewer is right in pointing out that “symbolic” should have been more clearly defined. We now added in the introduction:

      (introduction) “[…] we sometimes refer to this model as “symbolic” because it relies on discrete, exact, rule-based features rather than continuous representations  (Sablé-Meyer et al., 2022). In this representational format, geometric shapes are postulated to be represented by symbolic expressions in a “language-of-thought”, e.g. “a square is a four-sided figure with four equal sides and four right angles” or equivalently by a computer-like program from drawing them in a Logo-like language (Sablé-Meyer et al., 2022).”

      Here, however, the present experiments do not directly probe this format of a representation. We have therefore simplified our wording and removed many of our use of the word “symbolic” in favor of the more specific “geometric features”.

      Pigeons have remarkable visual systems. According to my fallible memory, Herrnstein investigated visual categories in pigeons. They can recognize individual people from fragments of photos, among other feats. I believe pigeons failed at geometric figures and also at cartoon drawings of things they could recognize in photos. This suggests they did not interpret line drawings of objects as representations of objects.

      The comparison of geometric abilities across species is an interesting line of research. In the discussion, we briefly mention several lines of research that indicate that non-human primates do not perceive geometric shapes in the same way as we do – but for space reasons, we are reluctant to expand this section to a broader review of other more distant species. The referee is right that there is evidence of pigeons being able to perceive an invariant abstract 3D geometric shape in spite of much variation in viewpoint (Peissig et al., 2019) – but there does not seem to be evidence that they attend to geometric regularities specifically (e.g. squares versus non-squares). Also, the referee’s point bears on the somewhat different issue of whether humans and other animals may recognize the object depicted by a symbolic drawing (e.g. a sketch of a tree). Again, humans seem to be vastly superior in this domain, and research on this topic is currently ongoing in the lab. However, the point that we are making in the present work is specifically about the neural correlates of the representation of simple geometric shapes which by design were not intended to be interpretable as representations of objects.

      Categories are established in part by contrast categories; are quadrilaterals, triangles, and circles different categories?

      We are not sure how to interpret the referee’s question, since it bears on the definition of “category” (Spontaneous? After training? With what criterion?). While we are not aware of data that can unambiguously answer the reviewer’s question, categorical perception in geometric shapes can be inferred from early work investigating pop-out effects in visual search, e.g. (Treisman and Gormican, 1988): curvature appears to generate strong pop-out effects, and therefore we would expect e.g. circles to indeed be a different category than, say, triangles. Similarly, right angles, as well as parallel lines, have been found to be perceived categorically (Dillon et al., 2019).

      This suggests that indeed squares would be perceived as categorically different from triangles and circles. On the other hand, in our own previous work (Sablé-Meyer et al., 2021) we have found that the deviants that we generated from our quadrilaterals did not pop out from displays of reference quadrilaterals. Pop-out is probably not the proper criterion for defining what a “category” is, but this is the extent to which we can provide an answer to the reviewer’s question.

      It would be instructive to investigate stimuli that are on a continuum from representational to geometric, e.g., table tops or cartons under various projections, or balls or buildings that are rectangular or triangular. Building parts, inside and out. like corners. Objects differ from geometric forms in many ways: 3D rather than 2D, more complicated shapes, and internal texture. The geometric figures used are flat, 2-D, but much geometry is 3-D (e. g. cubes) with similar abstract features.

      We agree that there is a whole line of potential research here. We decided to start by focusing on the simplest set of geometric shapes that would give us enough variation in geometric regularity while being easy to match on other visual features. We agree with the reviewer that our results should hold both for more complex 2-D shapes, but also for 3-D shapes. Indeed, generative theories of shapes in higher dimensions following similar principles as ours have been devised (I. Biederman, 1987; Leyton, 2003).  We now mention this in the discussion:

      “Finally, this research should ultimately be extended to the representation of 3-dimensional geometric shapes, for which similar symbolic generative models have indeed been proposed (Irving Biederman, 1987; Leyton, 2003).”

      The feature space of geometry is more than parallelism and symmetry; angles are important, for example. Listing and testing features would be fascinating. Similarly, looking at younger or preferably non-Western children, as Western children are exposed to shapes in play at early ages.

      We agree with the reviewer on all point. While we do not list and test the different properties separately in this work, we would like to highlight that angles are part of our geometric feature model, which includes features of “right-angle” and “equal-angles” as suggested by the reviewer.

      We also agree about the importance of testing populations with limited exposure to formal training with geometric shapes. This was in fact a core aspect of a previous article of ours which tests both preschoolers, and adults with no access to formal western education – though no non-Western children (Sablé-Meyer et al., 2021). It remains a challenge to perform brain-imaging studies in non-Western populations (although see Dehaene et al., 2010; Pegado et al., 2014).

      What in human experience but not the experience of close primates would drive the abstraction of these geometric properties? It's easy to make a case for elaborate brain processes for recognizing and distinguishing things in the world, shared by many species, but the case for brain areas sensitive to processing geometric figures is harder. The fact that these areas are active in blind mathematicians and that they are parietal areas suggests that what is important is spatial far more than visual. Could these geometric figures and their abstract properties be connected in some way to behavior, perhaps with fabrication and construction as well as use? Or with other interactions with complex objects and environments where symmetry and parallelism (and angles and curvature--and weight and size) would be important? Manual dexterity and fabrication also distinguish humans from great apes (quantitatively, not qualitatively), and action drives both visual and spatial representations of objects and spaces in the brain. I certainly wouldn't expect the authors to add research to this already packed paper, but raising some of the conceptual issues would contribute to the significance of the paper.

      We refrained from speculating about this point in the previous version of the article, but share some of the reviewers’ intuitions about the underlying drive for geometric abstraction. As described in (Dehaene, 2026; Sablé-Meyer et al., 2022), our hypothesis, which isn’t tested in the present article, is that the emergence of a pervasive ability to represent aspects of the world as compact expressions in a mental “language-of-thought” is what underlies many domains of specific human competence, including some listed by the reviewer (tool construction, scene understanding) and our domain of study here, geometric shapes.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      Overall, I enjoyed reading this paper. It is clearly written and nicely showcases the amount of work that has gone into conducting all these experiments and analyzing the data in sophisticated ways. I also thought the figures were great, and I liked the level of organization in the GitHub repository and am looking forward to seeing the shared data on OpenNeuro. I have some specific questions I hope the authors can address.

      (1) Behavior

      - Looking at Figure 1, it seemed like most shapes are clustering together, whereas square, rectangle, and maybe rhombus and parallelogram are slightly more unique. I was wondering whether the authors could comment on the potential influence of linguistic labels. Is it possible that it is easier to discard the intruder when the shapes are readily nameable versus not?

      This is an interesting observation, but the existence of names for shapes does not suffice to explain all of our findings ; see our reply to the public comment.

      (2) fMRI

      - As mentioned in the public review, I was surprised that the authors went with an intruder task because I would imagine that performance depends on the specific combination of geometric shapes used within a trial. I assume it is much harder to find, for example, a "Right Hinge" embedded within "Hinge" stimuli than a "Right Hinge" amongst "Squares". In addition, the rotation and scaling of each individual item should affect regular shapes less than irregular shapes, creating visual dissimilarities that would presumably make the task harder. Can the authors comment on how we can be sure that the differences we pick up in the parietal areas are not related to task difficulty but are truly related to geometric shape regularities?

      Again, please see our public review response for a larger discussion of the impact of task difficulty. There are two aspects to answering this question.

      First, the task is not as the reviewer describes: the intruder task is to find a deviant shape within several slightly rotated and scaled versions of the regular shape it came from. During brain imaging, we did not ask participants to find an exemplar of one of our reference shape amidst copies of another, but rather a deviant version of one shape against copies of its reference version. We only used this intruder task with all pairs of shapes to generate the behavioral RSA matrix.

      Second, we agree that some of the fMRI effect may stem from task difficulty, and this motivated our use of RSA analysis in fMRI, and a passive MEG task. RSA results cannot be explained by task difficulty.

      Overall, we have tried to make the limitations of the fMRI design, and the motivation for turning to passive presentation in MEG, clearer by stating the issues more clearly when we introduce experiment 4:

      “The temporal resolution of fMRI does not allow to track the dynamic of mental representations over time. Furthermore, the previous fMRI experiment suffered from several limitations. First, we studied six quadrilaterals only, compared to 11 in our previous behavioral work. Second, we used an explicit intruder detection, which implies that the geometric regularity effect was correlated with task difficulty, and we cannot exclude that this factor alone explains some of the activations in figure 3C (although it is much less clear how task difficulty alone would explain the RSA results in figure 3D). Third, the long display duration, which was necessary for good task performance especially in children, afforded the possibility of eye movements, which were not monitored inside the 3T scanner and again could have affected the activations in figure 3C.”

      - How far in the periphery were the stimuli presented? Was eye-tracking data collected for the intruder task? Similar to the point above, I would imagine that a harder trial would result in more eye movements to find the intruder, which could drive some of the differences observed here.

      A 1-degree bar was added to Figure 3A, which faithfully illustrates how the stimuli were presented in fMRI. Eye-tracking data was not collected during fMRI. Although the participants were explicitly instructed to fixate at the center of the screen and avoid eye movements, we fully agree with the referee that we cannot exclude that eye movements were present, perhaps more so for more difficult displays, and would therefore have contributed to the observed fMRI activations in experiment 3 (figure 3C). We now mention this limitation explicity at the end of experiment 3. However, crucially, this potential problem cannot apply to the MEG data. During the MEG task, the stimuli were presented one by one at the center of screen, without any explicit task, thus avoiding issues of eye movements. We therefore consider the MEG geometrical regularity effect, which comes at a relatively early latency (starting at ~160 ms) and even in a passive task, to provide the strongest evidence of geometric coding, unaffected by potential eye movement artefacts. 

      - I was wondering whether the authors would consider showing some un-thresholded maps just to see how widespread the activation of the geometric shapes is across all of the cortex.

      We share the uncorrected threshold maps in Fig. S3. for both adults and children in the category localizer, copied here as well. For the geometry task, most of the clusters identified are fairly big and survive cluster-corrected permutations; the uncorrected statistical maps look almost fully identical to the one presented in Fig. 3 (p<.001 map).

      - I'm missing some discussion on the role of early visual areas that goes beyond the RSA-CNN comparison. I would imagine that early visual areas are not only engaged due to top-down feedback (line 258) but may actually also encode some of the geometric features, such as parallel lines and symmetry. Is it feasible to look at early visual areas and examine what the similarity structure between different shapes looks like?

      If early visual areas encoded the geometric features that we propose, then even early sensor-level RSA matrices should show a strong impact of geometric features similarity, which is not what we find (figure 4D). We do, however, appreciate the referee’s request to examine more closely how this similarity structure looks like. We now provide a movie showing the significant correlation between neural activity and our two models (uncorrected participants); indeed, while the early occipital activity (around 110ms) is dominated by a significant correlation with the CNN model, there are also scattered significant sources associated to the symbolic model around these timepoints already.

      To test this further, we used beamformers to reconstruct the source-localized activity in calcarine cortex and performed an RSA analysis across that ROI. We find that indeed the CNN model is strongly significant at t=110ms (t=3.43, df=18, p=.003) while the geometric feature model is not (t=1.04, df=18, p=.31), and the CNN is significantly above the geometric feature model (t=4.25, df=18, p<.001). However, this result is not very stable across time, and there are significant temporal clusters around these timepoints associated to each model, with no significant cluster associated to a CNN > geometric (CNN: significant cluster from 88ms to 140ms, p<.001 in permutation based with 10000 permutations; geometric features has a significant cluster from 80ms to 104ms, p=.0475; no significant cluster on the difference between the two).

      (3) MEG

      - Similar to the fMRI set, I am a little worried that task difficulty has an effect on the decoding results, as the oddball should pop out more in more geometric shapes, making it easier to detect and easier to decode. Can the authors comment on whether it would matter for the conclusions whether they are decoding varying task difficulty or differences in geometric regularity, or whether they think this can be considered similarly?

      See above for an extensive discussion of the task difficulty effect. We point out that there is no task in the MEG data collection part. We have clarified the task design by updating our Fig. 4. Additionally, the fact that oddballs are more perceived more or less easily as a function of their geometric regularity is, in part, exactly the point that we are making – but, in MEG, even in the absence of a task of looking for them.

      - The authors discuss that the inflated baseline/onset decoding/regression estimates may occur because the shapes are being repeated within a mini-block, which I think is unlikely given the long ISIs and the fact that the geometric features model is not >0 at onset. I think their second possible explanation, that this may have to do with smoothing, is very possible. In the text, it said that for the non-smoothed result, the CNN encoding correlates with the data from 60ms, which makes a lot more sense. I would like to encourage the authors to provide readers with the unsmoothed beta values instead of the 100-ms smoothed version in the main plot to preserve the reason they chose to use MEG - for high temporal resolution!

      We fully agree with the reviewer and have accordingly updated the figures to show the unsmoothed data (see below). Indeed, there is now no significant CNN effect before ~60 ms (up to the accuracy of identifying onsets with our method).

      - In Figure 4C, I think it would be useful to either provide error bars or show variability across participants by plotting each participant's beta values. I think it would also be nice to plot the dissimilarity matrices based on the MEG data at select timepoints, just to see what the similarity structure is like.

      Following the reviewer’s recommendation, we plot the timeseries with SEM as shaded area, and thicker lines for statistically significant clusters, and we provide the unsmoothed version in figure Fig. 4. As for the dissimilarity matrices at select timepoints, this has now been added to figure Fig. 4.

      - To evaluate the source model reconstruction, I think the reader would need a little more detail on how it was done in the main text. How were the lead fields calculated? Which data was used to estimate the sources? How are the models correlated with the source data?

      We have imported some of the details in the main text as follows (as well as expanding the methods section a little):

      “To understand which brain areas generated these distinct patterns of activations, and probe whether they fit with our previous fMRI results, we performed a source reconstruction of our data. We projected the sensor activity onto each participant's cortical surfaces estimated from T1-images. The projection was performed using eLORETA and emptyroom recordings acquired on the same day to estimate noise covariance, with the default parameters of mne-bids-pipeline. Sources were spaced using a recursively subdivided octahedron (oct5). Group statistics were performed after alignement to fsaverage. We then replicated the RSA analysis […]”

      - In addition to fitting the CNN, which is used here to model differences in early visual cortex, have the authors considered looking at their fMRI results and localizing early visual regions, extracting a similarity matrix, and correlating that with the MEG and/or comparing it with the CNN model?

      We had ultimately decided against comparing the empirical similarity matrices from the MEG and fMRI experiments, first because the stimuli and tasks are different, and second because this would not be directly relevant to our goal, which is to evaluate whether a geometric-feature model accounts for the data. Thus, we systematically model empirical similarity matrices from fMRI and from MEG with our two models derived from different theories of shape perception in order to test predictions about their spatial and temporal dynamic. As for comparing the similarity matrix from early visual regions in fMRI with that predicted by the CNN model, this is effectively visible from our Fig. 3D where we perform searchlight RSA analysis and modeling with both the CNN and the geometric feature model; bilaterally, we find a correlation with the CNN model, although it sometimes overlap with predictions from the geometric feature model as well. We now include a section explaining this reasoning in appendix:

      “Representational similarity analysis also offers a way to directly compared similarity matrices measured in MEG and fMRI, thus allowing for fusion of those two modalities and tentatively assigning a “time stamp” to distinct MRI clusters. However, we did not attempt such an analysis here for several reasons. First, distinct tasks and block structures were used in MEG and fMRI. Second, a smaller list of shapes was used in fMRI, as imposed by the slower modality of acquisition. Third, our study was designed as an attempt to sort out between two models of geometric shape recognition. We therefore focused all analyses on this goal, which could not have been achieved by direct MEG-fMRI fusion, but required correlation with independently obtained model predictions.”

      Minor comments

      - It's a little unclear from the abstract that there is children's data for fMRI only.

      We have reworded the abstract to make this unambiguous

      - Figures 4a & b are missing y-labels.

      We can see how our labels could be confused with (sub-)plot titles and have moved them to make the interpretation clearer.

      - MEG: are the stimuli always shown in the same orientation and size?

      They are not, each shape has a random orientation and scaling. On top of a task example at the top of Fig. 4, we have now included a clearer mention of this in the main text when we introduce the task:

      “shapes were presented serially, one at a time, with small random changes in rotation and scaling parameters, in miniblocks with a fixed quadrilateral shape and with rare intruders with the bottom right corner shifted by a fixed amount (Sablé-Meyer et al., 2021)”

      - To me, the discussion section felt a little lengthy, and I wonder whether it would benefit from being a little more streamlined, focused, and targeted. I found that the structure was a little difficult to follow as it went from describing the result by modality (behavior, fMRI, MEG) back to discussing mostly aspects of the fMRI findings.

      We have tried to re-organize and streamline the discussion following these comments.

      Then, later on, I found that especially the section on "neurophysiological implementation of geometry" went beyond the focus of the data presented in the paper and was comparatively long and speculative.

      We have reexamined the discussion, but the citation of papers emphasizing a representation of non-accidental geometric properties in non-human animals was requested by other commentators on our article; and indeed, we think that they are relevant in the context of our prior suggestion that the composition of geometric features might be a uniquely human feature – these papers suggest that individual features may not, and that it is therefore compositionality which might be special to the human brain. We have nevertheless shortened it.

      Furthermore, we think that this section is important because symbolic models are often criticized for lack of a plausible neurophysiological implementation. It is therefore important to discuss whether and how the postulated symbolic geometric code could be realized in neural circuits. We have added this justification to the introduction of this section.

      Reviewer #2 (Recommendations for the authors):

      (1) If the authors want to specifically claim that their findings align with mathematical reasoning, they could at least show the overlap between the activation maps of the current study and those from prior work.

      This was added to the fMRI results. See our answers to the public review.

      (2) I wonder if the reason the authors only found aIPS in their first analysis (Figure 2) is because they are contrasting geometric shapes with figures that also have geometric properties. In other words, faces, objects, and houses also contain geometric shape information, and so the authors may have essentially contrasted out other areas that are sensitive to these features. One indication that this may be the case is that the geometric regularity effect and searchlight RSA (Figure 3) contains both anterior and posterior IPS regions (but crucially, little ventral activity). It might be interesting to discuss the implications of these differences.

      Indeed, we cannot exclude that the few symmetries, perpendicularity and parallelism cues that can be presented in faces, objects or houses were processed as such, perhaps within the ventral pathway, and that these representations would have been subtracted out. We emphasize that our subtraction isolates the geometrical features that are present in simple regular geometric shapes, over and above those that might exist in other categories. We have added this point to the discussion:

      “[… ] For instance, faces possess a plane of quasi-symmetry, and so do many other man-made tools and houses. Thus, our subtraction isolated the geometrical features that are present in simple regular geometric shapes (e.g. parallels, right angles, equality of length) over and above those that might already exist, in a less pure form, in other categories.”

      (3) I had a few questions regarding the MEG results.

      a. I didn't quite understand the task. What is a regular or oddball shape in this context? It's not clear what is being decoded. Perhaps a small example of the MEG task in Figure 4 would help?

      We now include an additional sub-figure in Fig. 4 to explain the paradigm. In brief: there is no explicit task, participants are simply asked to fixate. The shapes come in miniblocks of 30 identical reference shapes (up to rotation and scaling), among which some occasional deviant shapes randomly appear (created by moving the corner of the reference shape by some amount).

      b. In Figure 4A/B they describe the correlation with a 'symbolic model'. Is this the same as the geometric model in 4C?

      It is. We have removed this ambiguity by calling it “geometric model” and setting its color to the one associated to this model thought the article.

      c. The author's explanation for why geometric feature coding was slower than CNN encoding doesn't quite make sense to me. As an explanation, they suggest that previous studies computed "elementary features of location or motor affordance", whereas their study work examines "high-level mathematical information of an abstract nature." However, looking at the studies the authors cite in this section, it seems that these studies also examined the time course of shape processing in the dorsal pathway, not "elementary features of location or motor affordance." Second, it's not clear how the geometric feature model reflects high-level mathematical information (see point above about claiming this is related to math).

      We thank the referee for pointing out this inappropriate phrase, which we removed. We rephrased the rest of the paragraph to clarify our hypothesis in the following way:

      “However, in this work, we specifically probed the processing of geometric shapes that, if our hypothesis is correct, are represented as mental expressions that combine geometrical and arithmetic features of an abstract categorical nature, for instance representing “four equal sides” or “four right angles”. It seems logical that such expressions, combining number, angle and length information, take more time to be computed than the first wave of feedforward processing within the occipito-temporal visual pathway, and therefore only activate thereafter.”

      One explanation may be that the authors' geometric shapes require finer-grained discrimination than the object categories used in prior studies. i.e., the odd-ball task may be more of a fine-grained visual discrimination task. Indeed, it may not be a surprise that one can decode the difference between, say, a hammer and a butterfly faster than two kinds of quadrilaterals.

      We do not disagree with this intuition, although note that we do not have data on this point (we are reporting and modelling the MEG RSA matrix across geometric shapes only – in this part, no other shapes such as tools or faces are involved). Still, the difference between squares, rectangles, parallelograms and other geometric shapes in our stimuli is not so subtle. Furthermore, CNNs do make very fine grained distinctions, for instance between many different breeds of dogs in the IMAGENET corpus. Still, those sorts of distinctions capture the initial part of the MEG response, while the geometric model is needed only for the later part. Thus, we think that it is a genuine finding that geometric computations associated with the dorsal parietal pathway are slower than the image analysis performed by the ventral occipito-temporal pathway.

      d. CNN encoding at time 0 is a little weird, but the author's explanation, that this is explained by the fact that temporal smoothed using a 100 ms window makes sense. However, smoothing by 100 ms is quite a lot, and it doesn't seem accurate to present continuous time course data when the decoding or RSA result at each time point reflects a 100 ms bin. It may be more accurate to simply show unsmoothed data. I'm less convinced by the explanation about shape prediction.

      We agree. Following the reviewer’s advice, as well as the recommendation from reviewer 1, we now display unsmoothed plots, and the effects now exhibit a more reasonable timing (Figure 4D), with effects starting around ~60 ms for CNN encoding.

      (4) I appreciate the author's use of multiple models and their explanation for why DINOv2 explains more variance than the geometric and CNN models (that it represents both types of features. A variance partitioning analysis may help strengthen this conclusion (Bonner & Epstein, 2018; Lescroart et al., 2015).

      However, one difference between DINOv2 and the CNN used here is that it is trained on a dataset of 142 million images vs. the 1.5 million images used in ImageNet. Thus, DINOv2 is more likely to have been exposed to simple geometric shapes during training, whereas standard ImageNet trained models are not. Indeed, prior work has shown that lesioning line drawing-like images from such datasets drastically impairs the performance of large models (Mayilvahanan et al., 2024). Thus, it is unlikely that the use of a transformer architecture explains the performance of DINOv2. The authors could include an ImageNet-trained transformer (e.g., ViT) and a CNN trained on large datasets (e.g., ResNet trained on the Open Clip dataset) to test these possibilities. However, I think it's also sufficient to discuss visual experience as a possible explanation for the CNN and DINOv2 results. Indeed, young children are exposed to geometric shapes, whereas ImageNet-trained CNNs are not.

      We agree with the reviewer’s observation. In fact, new and ongoing work from the lab is also exploring this; we have included in supplementary materials exactly what the reviewer is suggesting, namely the time course of the correlation with ViT and with ConvNeXT. In line with the reviewers’ prediction, these networks, trained on much larger dataset and with many more parameters, can also fit the human data as well as DINOv2. We ran additional analysis of the MEG data with ViT and ConvNeXT, which we now report in Fig. S6 as well as in an additional sentence in that section:

      “[…] similar results were obtained by performing the same analysis, not only with another vision transformer network, ViT, but crucially using a much larger convolutional neural network, ConvNeXT, which comprises ~800M parameters and has been trained on 2B images, likely including many geometric shapes and human drawings. For the sake of completeness, RSA analysis in sensor space of the MEG data with these two models is provided in Fig. S6.”

      We conclude that the size and nature of the training set could be as important as the architecture – but also note that humans do not rely on such a huge training set. We have updated the text, as well as Fig. S6, accordingly by updating the section now entitled “Vision Transformers and Larger Neural Networks”, and the discussion section on theoretical models.

      (5) The authors may be interested in a recent paper from Arcaro and colleagues that showed that the parietal cortex is greatly expanded in humans (including infants) compared to non-human primates (Meyer et al., 2025), which may explain the stronger geometric reasoning abilities of humans.

      A very interesting article indeed! We have updated our article to incorporate this reference in the discussion, in the section on visual pathways, as follows:

      “Finally, recent work shows that within the visual cortex, the strongest relative difference in growth between human and non-human primates is localized in parietal areas (Meyer et al., 2025). If this expansion reflected the acquisition of new processing abilities in these regions, it  might explain the observed differences in geometric abilities between human and non-human primates (Sablé-Meyer et al., 2021).”

      Also, the authors may want to include this paper, which uses a similar oddity task and compelling shows that crows are sensitive to geometric regularity:

      Schmidbauer, P., Hahn, M., & Nieder, A. (2025). Crows recognize geometric regularity. Science Advances, 11(15), eadt3718. https://doi.org/10.1126/sciadv.adt3718

      We have ongoing discussions with the authors of this work and are  have prepared a response to their findings (Sablé-Meyer and Dehaene, 2025)–ultimately, we think that this discussion, which we agree is important, does not have its place in the present article. They used a reduced version of our design, with amplified differences in the intruders. While they did not test the fit of their model with CNN or geometric feature models, we did and found that a simple CNN suffices to account for crow behavior. Thus, we disagree that their conclusions follow from their results and their conclusions. But the present article does not seem to be the right platform to engage in this discussion.

      References

      Ayzenberg, V., & Behrmann, M. (2022). The Dorsal Visual Pathway Represents Object-Centered Spatial Relations for Object Recognition. The Journal of Neuroscience, 42(23), 4693-4710. https://doi.org/10.1523/jneurosci.2257-21.2022

      Bonner, M. F., & Epstein, R. A. (2018). Computational mechanisms underlying cortical responses to the affordance properties of visual scenes. PLoS Computational Biology, 14(4), e1006111. https://doi.org/10.1371/journal.pcbi.1006111

      Bueti, D., & Walsh, V. (2009). The parietal cortex and the representation of time, space, number and other magnitudes. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1525), 1831-1840.

      Dehaene, S., & Brannon, E. (2011). Space, time and number in the brain: Searching for the foundations of mathematical thought. Academic Press.

      Freud, E., Culham, J. C., Plaut, D. C., & Bermann, M. (2017). The large-scale organization of shape processing in the ventral and dorsal pathways. eLife, 6, e27576.

      Freud, E., Ganel, T., Shelef, I., Hammer, M. D., Avidan, G., & Behrmann, M. (2017). Three-dimensional representations of objects in dorsal cortex are dissociable from those in ventral cortex. Cerebral Cortex, 27(1), 422-434.

      Freud, E., Plaut, D. C., & Behrmann, M. (2016). 'What 'is happening in the dorsal visual pathway. Trends in Cognitive Sciences, 20(10), 773-784.

      Freud, E., Plaut, D. C., & Behrmann, M. (2019). Protracted developmental trajectory of shape processing along the two visual pathways. Journal of Cognitive Neuroscience, 31(10), 1589-1597.

      Han, Z., & Sereno, A. (2022). Modeling the Ventral and Dorsal Cortical Visual Pathways Using Artificial Neural Networks. Neural Computation, 34(1), 138-171. https://doi.org/10.1162/neco_a_01456

      Janssen, P., Srivastava, S., Ombelet, S., & Orban, G. A. (2008). Coding of shape and position in macaque lateral intraparietal area. Journal of Neuroscience, 28(26), 6679-6690.

      Konen, C. S., & Kastner, S. (2008). Two hierarchically organized neural systems for object information in human visual cortex. Nature Neuroscience, 11(2), 224-231.

      Lescroart, M. D., Stansbury, D. E., & Gallant, J. L. (2015). Fourier power, subjective distance, and object categories all provide plausible models of BOLD responses in scene-selective visual areas. Frontiers in Computational Neuroscience, 9(135), 1-20. https://doi.org/10.3389/fncom.2015.00135

      Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., & Brendel, W. (2024). In search of forgotten domain generalization. arXiv Preprint arXiv:2410.08258.

      Meyer, E. E., Martynek, M., Kastner, S., Livingstone, M. S., & Arcaro, M. J. (2025). Expansion of a conserved architecture drives the evolution of the primate visual cortex. Proceedings of the National Academy of Sciences, 122(3), e2421585122. https://doi.org/10.1073/pnas.2421585122

      Orban, G. A. (2011). The extraction of 3D shape in the visual system of human and nonhuman primates. Annual Review of Neuroscience, 34, 361-388.

      Romei, V., Driver, J., Schyns, P. G., & Thut, G. (2011). Rhythmic TMS over Parietal Cortex Links Distinct Brain Frequencies to Global versus Local Visual Processing. Current Biology, 21(4), 334-337. https://doi.org/10.1016/j.cub.2011.01.035

      Sereno, A. B., & Maunsell, J. H. R. (1998). Shape selectivity in primate lateral intraparietal cortex. Nature, 395(6701), 500-503. https://doi.org/10.1038/26752

      Summerfield, C., Luyckx, F., & Sheahan, H. (2020). Structure learning and the posterior parietal cortex. Progress in Neurobiology, 184, 101717. https://doi.org/10.1016/j.pneurobio.2019.101717

      Van Dromme, I. C., Premereur, E., Verhoef, B.-E., Vanduffel, W., & Janssen, P. (2016). Posterior Parietal Cortex Drives Inferotemporal Activations During Three-Dimensional Object Vision. PLoS Biology, 14(4), e1002445. https://doi.org/10.1371/journal.pbio.1002445

      Xu, Y. (2018). A tale of two visual systems: Invariant and adaptive visual information representations in the primate brain. Annu. Rev. Vis. Sci, 4, 311-336.

      Reviewer #3 (Recommendations for the authors):

      Bring into the discussion some of the issues outlined above, especially a) the spatial rather than visual of the geometric figures and b) the non-representational aspects of geometric form aspects.

      We thank the reviewer for their recommendations – see our response to the public review for more details.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This paper presents two experiments, both of which use a target detection paradigm to investigate the speed of statistical learning. The first experiment is a replication of Batterink, 2017, in which participants are presented with streams of uniform-length, trisyllabic nonsense words and asked to detect a target syllable. The results replicate previous findings, showing that learning (in the form of response time facilitation to later-occurring syllables within a nonsense word) occurs after a single exposure to a word. In the second experiment, participants are presented with streams of variable-length nonsense words (two trisyllabic words and two disyllabic words) and perform the same task. A similar facilitation effect was observed as in Experiment 1. The authors interpret these findings as evidence that target detection requires mechanisms different from segmentation. They present results of a computational model to simulate results from the target detection task and find that an "anticipation mechanism" can produce facilitation effects, without performing segmentation. The authors conclude that the mechanisms involved in the target detection task are different from those involved in the word segmentation task.

      Strengths:

      The paper presents multiple experiments that provide internal replication of a key experimental finding, in which response times are facilitated after a single exposure to an embedded pseudoword. Both experimental data and results from a computational model are presented, providing converging approaches for understanding and interpreting the main results. The data are analyzed very thoroughly using mixed effects models with multiple explanatory factors.

      Weaknesses:

      In my view, the main weaknesses of this study relate to the theoretical interpretation of the results.

      (1) The key conclusion from these findings is that the facilitation effect observed in the target detection paradigm is driven by a different mechanism (or mechanisms) than those involved in word segmentation. The argument here I think is somewhat unclear and weak, for several reasons:

      First, there appears to be some blurring in what exactly is meant by the term "segmentation" with some confusion between segmentation as a concept and segmentation as a paradigm.

      Conceptually, segmentation refers to the segmenting of continuous speech into words. However, this conceptual understanding of segmentation (as a theoretical mechanism) is not necessarily what is directly measured by "traditional" studies of statistical learning, which typically (at least in adults) involve exposure to a continuous speech stream followed by a forced-choice recognition task of words versus recombined foil items (part-words or nonwords). To take the example provided by the authors, a participant presented with the sequence GHIABCDEFABCGHI may endorse ABC as being more familiar than BCG, because ABC is presented more frequently together and the learned association between A and B is stronger than between C and G. However, endorsement of ABC over BCG does not necessarily mean that the participant has "segmented" ABC from the speech stream, just as faster reaction times in responding to syllable C versus A do not necessarily indicate successful segmentation. As the authors argue on page 7, "an encounter to a sequence in which two elements co-occur (say, AB) would theoretically allow the learner to use the predictive relationship during a subsequent encounter (that A predicts B)." By the same logic, encoding the relationship between A and B could also allow for the above-chance endorsement of items that contain AB over items containing a weaker relationship.

      Both recognition performance and facilitation through target detection reflect different outcomes of statistical learning. While they may reflect different aspects of the learning process and/or dissociable forms of memory, they may best be viewed as measures of statistical learning, rather than mechanisms in and of themselves.

      Thanks for this nuanced discussion, and this is an important point that R2 also raised. We agree that segmentation can refer to both an experimental paradigm and a mechanism that accounts for learning in the experimental paradigm. In the experimental paradigm, participants are asked to identify which words they believe to be (whole) words from the continuous syllable stream. In the target-detection experimental paradigm, participants are not asked to identify words from continuous streams, and instead, they respond to the occurrences of a certain syllable. It’s possible that learners employ one mechanism in these two tasks, or that they employ separate mechanisms. It’s also the case that, if all we have is positive evidence for both experimental paradigms, i.e., learners can succeed in segmentation tasks as well as in target detection tasks with different types of sequences, we would have no way of talking about different mechanisms, as you correctly suggested that evidence for segmenting AB and processing B faster following A, is not evidence for different mechanisms.

      However, that is not the case. When the syllable sequences contain same-length subsequences (i.e., words), learning is indeed successful in both segmentation and target detection tasks. However, in studies such as Hoch et al. (2013), findings suggest that words from mixed-length sequences are harder to segment than words from uniform-length sequences. This finding exists in adult work (e.g., Hoch et al. 2013) as well as infant work (Johnson & Tyler, 2010), and replicated here in the newly included Experiment 3, which stands in contrast to the positive findings of the facilitation effect with mixed-length sequences in the target detection paradigm (one of our main findings in the paper). Thus, it seems to be difficult to explain, if the learning mechanisms were to be the same, why humans can succeed in mixed-length sequences in target detection (as shown in Experiment 2) but fail in uniform-length sequences (as shown in Hoch et al. and Experiment 3).

      In our paper, we have clarified these points describe the separate mechanisms in more detail, in both the Introduction and General Discussion sections.

      (2) The key manipulation between experiments 1 and 2 is the length of the words in the syllable sequences, with words either constant in length (experiment 1) or mixed in length (experiment 2). The authors show that similar facilitation levels are observed across this manipulation in the current experiments. By contrast, they argue that previous findings have found that performance is impaired for mixed-length conditions compared to fixed-length conditions. Thus, a central aspect of the theoretical interpretation of the results rests on prior evidence suggesting that statistical learning is impaired in mixed-length conditions. However, it is not clear how strong this prior evidence is. There is only one published paper cited by the authors - the paper by Hoch and colleagues - that supports this conclusion in adults (other mentioned studies are all in infants, which use very different measures of learning). Other papers not cited by the authors do suggest that statistical learning can occur to stimuli of mixed lengths (Thiessen et al., 2005, using infant-directed speech; Frank et al., 2010 in adults). I think this theoretical argument would be much stronger if the dissociation between recognition and facilitation through RTs as a function of word length variability was demonstrated within the same experiment and ideally within the same group of participants.

      To summarize the evidence of learning uniform-length and mixed-length sequences (which we discussed in the Introduction section), “even though infants and adults alike have shown success segmenting syllable sequences consisting of words that were uniform in length (i.e., all words were either disyllabic; Graf Estes et al., 2007; or trisyllabic, Aslin et al., 1998), both infants and adults have shown difficulty with syllable sequences consisting of words of mixed length (Johnson & Tyler, 2010; Johnson & Jusczyk, 2003a; 2003b; Hoch et al., 2013).” The newly added Experiment 3 also provided evidence for the difference in uniform-length and mixed-length sequences. Notably, we do not agree with the idea that infant work should be disregarded as evidence just because infants were tested with habituation methods; not only were the original findings (Saffran et al. 1996) based on infant work, so were many other studies on statistical learning.

      There are other segmentation studies in the literature that have used mixed-length sequences, which are worth discussing. In short, these studies differ from the Saffran et al. (1996) studies in many important ways, and in our view, these differences explain why the learning was successful. Of interest, Thiessen et al. (2005) that you mentioned was based on infant work with infant methods, and demonstrated the very point we argued for: In their study, infants failed to learn when mixed-length sequences were pronounced as adult-directed speech, and succeeded in learning given infant-directed speech, which contained prosodic cues that were much more pronounced. The fact that infants failed to segment mixed-length sequences without certain prosodic cues is consistent with our claim that mixed-length sequences are difficult to segment in a segmentation paradigm. Another such study is Frank et al. (2010), where continuous sequences were presented in “sentences”. Different numbers of words were concatenated into sentences where a 500ms break was present between each sentence in the training sequence. One sentence contained only one word, or two words, and in the longest sentence, there were 24 words. The results showed that participants are sensitive to the effect of sentence boundaries, which coincide with word boundaries. In the extreme, the one-word-per-sentence condition simply presents learners with segmented word forms. In the 24-word-per-sentence condition, there are nevertheless sentence boundaries that are word boundaries, and knowing these word boundaries alone should allow learners to perform above chance in the test phase. Thus, in our view, this demonstrates that learners can use sentence boundaries to infer word boundaries, which is an interesting finding in its own right, but this does not show that a continuous syllable sequence with mixed word lengths is learnable without additional information. In summary, to our knowledge, syllable sequences containing mixed word lengths are better learned when additional cues to word boundaries are present, and there is strong evidence that syllable sequences containing uniform-word lengths are learned better than mixed-length ones.

      Frank, M. C., Goldwater, S., Griffiths, T. L., & Tenenbaum, J. B. (2010). Modeling human performance in statistical word segmentation. Cognition, 117(2), 107-125.

      To address your proposal of running more experiments to provide stronger evidence for our theory, we were planning to run another study to have the same group of participants do both the segmentation and target detection paradigm as suggested, but we were unable to do so as we encountered difficulties to run English-speaking participants. Instead, we have included an experiment (now Experiment 3), showing the difference between the learning of uniform-length and mixed-length sequences with the segmentation paradigm that we have never published previously. This experiment provides further evidence for adults’ difficulties in segmenting mixed-length sequences.

      (3) The authors argue for an "anticipation" mechanism in explaining the facilitation effect observed in the experiments. The term anticipation would generally be understood to imply some kind of active prediction process, related to generating the representation of an upcoming stimulus prior to its occurrence. However, the computational model proposed by the authors (page 24) does not encode anything related to anticipation per se. While it demonstrates facilitation based on prior occurrences of a stimulus, that facilitation does not necessarily depend on active anticipation of the stimulus. It is not clear that it is necessary to invoke the concept of anticipation to explain the results, or indeed that there is any evidence in the current study for anticipation, as opposed to just general facilitation due to associative learning.

      Thanks for raising this point. Indeed, the anticipation effect we reported is indistinguishable from the facilitation effect that we reported in the reported experiments. We have dropped this framing.

      In addition, related to the model, given that only bigrams are stored in the model, could the authors clarify how the model is able to account for the additional facilitation at the 3rd position of a trigram compared to the 2nd position?

      Thanks for the question. We believe it is an empirical question whether there is an additional facilitation at the 3rd position of a trigram compared to the 2nd position. To investigate this issue, we conducted the following analysis with data from Experiment 1. First, we combined the data from two conditions (exact/conceptual) from Experiment 1 so as to have better statistical power. Next, we ran a mixed effect regression with data from syllable positions 2 and 3 only (i.e., data from syllable position 1 were not included). The fixed effect included the two-way interaction between syllable position and presentation, as well as stream position, and the random effect was a by-subject random intercept and stream position as the random slope. This interaction was significant (χ<sup>2</sup>(3) =11.73, p=0.008), suggesting that there is additional facilitation to the 3rd position compared to the 2nd position.

      For the model, here is an explanation of why the model assumes an additional facilitation to the 3rd position. In our model, we proposed a simple recursive relation between the RT of a syllable occurring for the nth time and the n+1<sup>th</sup> time, which is:

      and

      RT(1) = RT0 + stream_pos * stream_inc, where the n in RT(n) represents the RT for the n<sup>th</sup> presentation of the target syllable, stream_pos is the position (3-46) in the stream, and occurrence is the number of occurrences that the syllable has occurred so far in the stream.

      What this means is that the model basically provides an RT value for every syllable in the stream. Thus, for a target at syllable position 1, there is a RT value as an unpredictable target, and for targets at syllable position 2, there is a facilitation effect. For targets at syllable position 3, it is facilitated the same amount. As such, there is an additional facilitation effect for syllable position 3 because effects of predication are recursive.

      (4) In the discussion of transitional probabilities (page 31), the authors suggest that "a single exposure does provide information about the transitions within the single exposure, and the probability of B given A can indeed be calculated from a single occurrence of AB." Although this may be technically true in that a calculation for a single exposure is possible from this formula, it is not consistent with the conceptual framework for calculating transitional probabilities, as first introduced by Saffran and colleagues. For example, Saffran et al. (1996, Science) describe that "over a corpus of speech there are measurable statistical regularities that distinguish recurring sound sequences that comprise words from the more accidental sound sequences that occur across word boundaries. Within a language, the transitional probability from one sound to the next will generally be highest when the two sounds follow one another within a word, whereas transitional probabilities spanning a word boundary will be relatively low." This makes it clear that the computation of transitional probabilities (i.e., Y | X) is conceptualized to reflect the frequency of XY / frequency of X, over a given language inventory, not just a single pair. Phrased another way, a single exposure to pair AB would not provide a reliable estimate of the raw frequencies with which A and AB occur across a given sample of language.

      Thanks for the discussion. We understand your argument, but we respectively disagree that computing transitional probabilities must be conducted under a certain theoretical framework. In our humble opinion, computing transitional probabilities is a mathematical operation, and as such, it is possible to do so with the least amount of data possible that enables the mathematical operation, which concretely is a single exposure during learning. While it is true that a single exposure may not provide a reliable estimate of frequencies or probabilities, it does provide information with which the learner can make decisions.

      This is particularly true for topics under discussion regarding the minimal amount of exposure that can enable learning. It is important to distinguish the following two questions: whether learners can learn from a short exposure period (from a single exposure, in fact) and how long of an exposure period does the learner require for it to be considered to produce a reliable estimate of frequencies. Incidentally, given the fact that learners can learn from a single exposure based on Batterink (2017) and the current study, it does not appear that learners require a long exposure period to learn about transitional probabilities.

      (5) In experiment 2, the authors argue that there is robust facilitation for trisyllabic and disyllabic words alike. I am not sure about the strength of the evidence for this claim, as it appears that there are some conflicting results relevant to this conclusion. Notably, in the regression model for disyllabic words, the omnibus interaction between word presentation and syllable position did not reach significance (p= 0.089). At face value, this result indicates that there was no significant facilitation for disyllabic words. The additional pairwise comparisons are thus not justified given the lack of omnibus interaction. The finding that there is no significant interaction between word presentation, word position, and word length is taken to support the idea that there is no difference between the two types of words, but could also be due to a lack of power, especially given the p-value (p = 0.010).

      Thanks for the comment. Firstly, we believe there is a typo in your comment, where in the last sentence, we believe you were referring to the p-value of 0.103 (source: “The interaction was not significant (χ2(3) = 6.19, p= 0.103”). Yes, a null result with a frequentist approach cannot support a null claim, but Bayesian analyses could potentially provide evidence for the null.

      To this end, we conducted a Bayes factor analysis using the approach outlined in Harms and Lakens (2018), which generates a Bayes factor by computing a Bayesian information criterion for a null model and an alternative model. The alternative model contained a three-way interaction of word length, word presentation, and word position, whereas the null model contained a two-way interaction between word presentation and word position as well as a main effect of word length. Thus, the two models only differ in terms of whether there is a three-way interaction. The Bayes factor is then computed as exp[(BICalt − BICnull)/2]. This analysis showed that there is strong evidence for the null, where the Bayes Factor was found to be exp(25.65) which is more than 1011. Thus, there is no power issue here, and there is strong evidence for the null claim that word length did not interact with other factors in Experiment 2.

      There is another issue that you mentioned, of whether we should conduct pairwise comparisons if the omnibus interaction did not reach significance. This would be true given the original analysis plan, but we believe that a revised analysis plan makes more sense. In the revised analysis plan for Experiment 2, we start with the three-way interaction (as just described in the last paragraph). The three-way interaction was not significant, and after dropping the third interaction terms, the two-way interaction and the main effect of word length are both significant, and we use this as the overall model. Testing the significance of the omnibus interaction between presentation and syllable position, we found that this was significant (χ<sup>2</sup>(3) =49.77, p<0.001). This represents that, in one model, that the interaction between presentation and syllable position using data from both disyllabic and trisyllabic words. This was in addition to a significant fixed effect of word length (β=0.018, z=6.19, p<0.001). This should motivate the rest of the planned analysis, which regards pairwise comparisons in different word length conditions.

      (6) The results plotted in Figure 2 seem to suggest that RTs to the first syllable of a trisyllabic item slow down with additional word presentations, while RTs to the final position speed up. If anything, in this figure, the magnitude of the effect seems to be greater for 1st syllable positions (e.g., the RT difference between presentation 1 and 4 for syllable position 1 seems to be numerically larger than for syllable position 3, Figure 2D). Thus, it was quite surprising to see in the results (p. 16) that RTs for syllable position 1 were not significantly different for presentation 1 vs. the later presentations (but that they were significant for positions 2 and 3 given the same comparison). Is this possibly a power issue? Would there be a significant slowdown to 1st syllables if results from both the exact replication and conceptual replication conditions were combined in the same analysis?

      Thanks for the suggestion and your careful visual inspection of the data. After combining the data, the slowdown to 1st syllables is indeed significant. We have reported this in the results of Experiment 1 (with an acknowledgement to this review):

      Results showed that later presentations took significantly longer to respond to compared to the first presentation (χ<sup>2</sup>(3) = 10.70, p=0.014), where the effect grew larger with each presentation (second presentation: β=0.011, z=1.82, p=0.069; third presentation: β=0.019, z=2.40, p=0.016; fourth presentation: β=0.034, z=3.23, p=0.001).

      (7) It is difficult to evaluate the description of the PARSER simulation on page 36. Perhaps this simulation should be introduced earlier in the methods and results rather than in the discussion only.

      Thanks for the suggestions. We have added two separate simulations in the paper, which should describe the PARSER simulations sufficiently, as well as provide further information on the correspondence between the simulations and the experiments. Thanks again for the great review! We believe our paper has improved significantly as a result.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      This study investigates how ant group demographics influence nest structures and group behaviors of Camponotus fellah ants, a ground-dwelling carpenter ant species (found locally in Israel) that build subterranean nest structures. Using a quasi-2D cell filled with artificial sand, the authors perform two complementary sets of experiments to try to link group behavior and nest structure: first, the authors place a mated queen and several pupae into their cell and observe the structures that emerge both before and after the pupae eclose (i.e., "colony maturation" experiments); second, the authors create small groups (of 5,10, or 15 ants, each including a queen) within a narrow age range (i.e., "fixed demographic" experiments) to explore the dependence of age on construction. Some of the fixed demographic instantiations included a manually induced catastrophic collapse event; the authors then compared emergency repair behavior to natural nest creation. Finally, the authors introduce a modified logistic growth model to describe the time-dependent nest area. The modification introduced parameters that allow for age-dependent behavior, and the authors use their fixed demographic experiments to set these parameters, and then apply the model to interpret the behavior of the colony maturation experiments. The main results of this paper are that for natural nest construction, nest areas, and morphologies depend on the age demographics of ants in the experiments: younger ants create larger nests and angled tunnels, while older ants tend to dig less and build predominantly vertical tunnels; in contrast, emergency response seems to elicit digging in ants of all ages to repair the nest.

      The experimental results are solid, providing new information and important insights into nest and colony growth in a social insect species. As presented, I still have some reservations about the model's contribution to a deeper understanding of the system. Additional context and explanation of the model, implications, and limitations would be helpful for readers.

      We sincerely thank Reviewer #1 for the time and effort dedicated to our manuscript's detailed review and assessment. The new revision suggestions were constructive, and we have provided a point-by-point response to address them.

      Reviewer #2 (Public review):

      I enjoyed this paper and its examination of the relationship between overall density and age polyethism to reduce the computational complexity required to match nest size with population. I had some questions about the requirement that growth is infinite in such a solution, but these have been addressed by the authors in the responses and the updated manuscript. I also enjoyed the discussion of whether collective behaviour is an appropriate framework in systems in which agents (or individuals) differ in the behavioural rules they employ, according to age, location, or information state. This is especially important in a system like social insects, typically held as a classic example of individual-as-subservient to whole, and therefore most likely to employ universal rules of behaviour. The current paper demonstrates a potentially continuous age-related change in target behaviour (excavation), and suggests an elegant and minimal solution to the requirement for building according to need in ants, avoiding the invocation of potentially complex cognitive mechanisms, or information states that all individuals must have access to in order to have an adaptive excavation output.

      The authors have addressed questions I had in the review process and the manuscript is now clear in its communication and conclusions.

      The modelling approach is compelling, also allowing extrapolation to other group sizes and even other species. This to me is the main strength of the paper, as the answer to the question of whether it is younger or older ants that primarily excavate nests could have been answered by an individual tracking approach (albeit there are practical limitations to this, especially in the observation nest setup, as the authors point out). The analysis of the tunnel structure is also an important piece of the puzzle, and I really like the overall study.

      We sincerely thank Reviewer #2 for the time and effort dedicated to our manuscript's detailed review and assessment.  

      Reviewer #1 (Recommendations for the authors):

      Thank you for the modifications. I found much of the additional information very helpful. I do still have a few comments, which I will include below.

      We thank the reviewer for this comment

      The authors provide some additional citations for the model, however, the ODE in refs 24 and 30 is different from what the authors present here, and different from what is presented in ref 29. Specifically, the additional "volume" term that multiplies the entire equation. Can the authors provide some additional context for their model in comparison to these models as well as how their model relates to other work?

      We thank the reviewer for this question. The primary difference between the logistic model (reference number: 24,30), and the saturation model (reference number: 29) is rooted in their assumptions on the scaling of the active number of ants that participate in the nest excavation and the nest volume.

      The logistic growth model ( 𝑑𝑉/𝑑𝑡 = α𝑉(1-V/Vs) describes the excavation in fixed-sized colonies (50, 100, 200) through a balance of two key processes : (1) positive feedback (α𝑉), where the digging efficiency increases with the nest size, and (2) negative feedback (1-V/Vs), where growth slows as the nest approaches a saturation (Vs). The model assumes that the number of actively excavating ants is linearly proportional to the nest volume (V). This represents a scenario where a large nest contains or can support more workers, which in turn increases the digging rates. While this does not require explicit communication between individuals, ants indirectly sense the global nest volume through stigmergic cues, such as pheromone depositions, encounter rates, while ignoring individual differences in age. 

      In contrast, the saturation model (𝑑𝑉/𝑑𝑡 = α𝑉(1-V/Vs)  assumes a constant number of ants is working throughout the excavation. The digging rate is therefore independent of the nest volume, this model imposes a different cognitive requirement ants must somehow assess the global nest slowing only due to the saturation term (1-V/Vs) as the nest approaches its target size. However, volume (V) and the overall number of ants in the nest. Thus, rather than relying on local cues, ants need more explicit communication or a sophisticated global perception mechanism that allows ants to sense the nest volume and the nest population to adjust the digging rates accordingly. Therefore, this model requires a more complex and less biologically plausible mechanism than the logistic model.

      In our age-dependent digging model in the manuscript, we explicitly sum the contribution of each ant towards the nest area expansion based on its age-dependent digging threshold (quantified from fixed demographics experiments) the sum over Thus, the term ‘V’ in the ‘ 𝑉(1-V/Vs) takes the same effect as sum over all ants in the equation (2) of our manuscript; they describe how the total excavation rate scales with the number of individuals. Under the simplifying assumption that the number of ants is proportional to the nest volume ‘V’, and that all ants dig at a constant rate, our equation (2) in the manuscript reduces to the logistic equation ‘𝑉(1-V/Vs)’ This implies that each ant individually assesses the nest volume and then digs at a rate ‘(1-V/Vs)’.

      Thus, we adopted the simpler model from the previously published ones, in which ants individually react to the local density cues and regulate their digging. This approach does not require a global assessment of the nest volume or the number of ants; a local perception of density triggers each ant’s decision to dig, likely modulated by the frequency of social contacts or chemical concentration, which serves as an indicator of the global nest area. The ant compares this locally perceived density to an innate, age-specific threshold. If the perceived local density exceeds its threshold (indicating insufficient area), it digs; otherwise, there is no digging. Thus, excavation dynamics in maturing colonies emerge from this collective response to local density cues, without any individual need to directly assess the global nest volume (V) or having explicit knowledge of the colony size (N).

      As suggested by the reviewer, we have added these points to the discussion, contrasting the previously published models with our age-dependent excavation models (line numbers: 283-290) “In our study, we adopted the simpler version of previously published age-independent excavation models, where individuals respond to local stigmergic cues such as encounter rates or pheromone concentrations, which serve as a proxy for the global nest volume (24,30). We minimally modified this model to include age-dependent density targets. According to our age-dependent digging model, each ant compares this perceived local density to its own innate age-specific digging threshold as quantified from the fixed demographics experiments. If the perceived local density exceeds its age-dependent area threshold (indicating insufficient area), it digs; otherwise, there is no digging. This mechanism eliminates the need for cognitively demanding global assessment of the total nest volume or the overall colony population, a requirement for the saturation model (29)”. 

      I still find it a little concerning that the age-independent model, though it cannot be correct, fits the data better than the age-dependent modification. It seems to me the models presented in refs 24, 29, and 30, which served as inspiration for the one presented here, do not have any deep theoretical origin, but were chosen for "being consistent with" the observed overall excavated volumes. Is this correct, and if so, how much can/should be gleaned about behavior from these models? Please provide some discussion of what is reasonable to expect from such a model as well as what the limitations might be.

      We thank the reviewer for the comment. 

      In our study, we make an important assumption, as described in the lines (line number : 161 - 164) of the manuscript, that ants rely on local cues during nest excavation, and individuals cannot distinguish between the fixed demographics and colony maturation conditions. This implies that the age-dependent target area identified in the fixed demographics experiments should also account for the excavation dynamics seen in the colony maturation experiments. 

      From the fixed demographics young and old experiments, we directly quantified that the younger ants excavate a significantly larger area than the older ants for the same group size. This age-dependent digging propensity is an experimental result, and not a model output. 

      We agree that the age-independent model fits the colony maturation experiments well, even though it's not a statistically better fit than the age-dependent model. However, the age-independent models in the references (24,29,30) fail to explain the empirically obtained excavation dynamics in the fixed demographics, young and old colonies. If indeed these models were true, then we would have observed similar excavated areas between the colony maturation, fixed demographics, young, and older colonies of the same size. Thus, the inconsistency of these models confirms that age-independent assumptions are biologically inadequate. These details are explicitly mentioned in lines (304 - 309).

      We believe that our model’s value is in providing a plausible explanation for the observed excavation dynamics in the colony maturation experiments, and generating testable predictions (Figure 4. C, and 4.D,  described in lines 199 - 216) about the percentage contribution of different age cohorts and queens to the excavated area from the colony maturation experiments. This prediction would not be possible with an age-independent model.

      Minor comments:

      Figure 2A: Please use a color other than white for the model... this curve is still very hard to see

      We thank the reviewer for the comment. The colour is changed to yellow. 

      Figure 4A: Should quoted confidence intervals for slope and intercept be swapped?

      Yes, we thank the reviewer for pointing this out. The labels for the slope and intercept were swapped. We corrected this in the current revised version 2. 

      Figure 5 D-F: Can the authors show data points and confidence intervals instead of bar graphs? The error bars dipping below zero do not clearly represent the data.

      We thank the reviewer for the comment. We now show the individual data points from each treatment with the 95% Confidence Interval of the mean.

    1. Author response:

      We sincerely thank the reviewers and editors for their thoughtful evaluations of our work. We are grateful for the careful reading, constructive critiques, and encouraging comments regarding the electrophysiological analyses, mutagenesis, and vascular experiments. The suggestions provided have been very helpful, and we are working to address these points in our revision to strengthen the manuscript and improve its clarity.

      In revising the manuscript, we plan to clarify several text passages as recommended by the reviewers, and review and refine the discussion for improved precision. Following the suggestions of the reviewers, we plan to perform a number of additional experiments to provide more data for the binding region and for further mechanistic and physiological insight. We will prepare a point-by-point response addressing all issues raised in a detailed rebuttal. Additionally, we will include improvements in the Methods section as suggested by the SciScore core report.

      We appreciate the opportunity to revise our work and thank the reviewers again for their valuable feedback.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors used a coarse-grained DNA model (cgNA+) to explore how DNA sequences and CpG methylation/hydroxymethylation influence nucleosome wrapping energy and the probability density of optimal nucleosomal configuration. Their findings indicate that both methylated and hydroxymethylated cytosines lead to increased nucleosome wrapping energy. Additionally, the study demonstrates that methylation of CpG islands increases the probability of nucleosome formation.

      Strengths:

      The major strength of this method is the model explicitly includes phosphate group as DNA-histone binding site constraints, enhancing CG model accuracy and computational efficiency and allowing comprehensive calculations of DNA mechanical properties and deformation energies.

      Weaknesses:

      A significant limitation of this study is that the parameter sets for the methylated and hydroxymethylated CpG steps in the cgNA+ model are derived from all-atom molecular dynamics (MD) simulations that use previously established force field parameters for modified cytosines (P´erez A, et al. Biophys J. 2012; Battistini, et al. PLOS Comput Biol. 2021). These parameters suggest that both methylated and hydroxymethylated cytosines increase DNA stiffness and nucleosome wrapping energy, which could predispose the coarse-grained model to replicate these findings. Notably, conflicting results from other all-atom MD simulations, such as those by Ngo T in Nat. Commun. 2016, shows that hydroxymethylated cytosines increase DNA flexibility, contrary to methylated cytosines. If the cgNA+ model were trained on these later parameters or other all-atom MD force fields, different conclusions might be obtained regarding the effects of methylated and hydroxymethylation on nucleosome formation.

      Despite the training parameters of the cgNA+ model, the results presented in the manuscript indicate that methylated cytosines increase both DNA stiffness and nucleosome wrapping energy. However, when comparing nucleosome occupancy scores with predicted nucleosome wrapping energies and optimal configurations, the authors find that methylated CGIs exhibit higher nucleosome occupancies than unmethylated ones, which seems to contradict the expected relationship where increased stiffness should reduce nucleosome formation affinity. In the manuscript, the authors also admit that these conclusions “apparently runs counter to the (perhaps naive) intuition that high nucleosome forming affinity should arise for fragments with low wrapping energy”. Previous all-atom MD simulations (P´erez A, et al. Biophys J. 2012; Battistini, et al. PLOS Comput Biol. 202; Ngo T, et al. Nat. Commun. 20161) show that the stiffer DNA upon CpG methylation reduces the affinity of DNA to assemble into nucleosomes or destabilizes nucleosomes. Given these findings, the authors need to address and reconcile these seemingly contradictory results, as the influence of epigenetic modifications on DNA mechanical properties and nucleosome formation are critical aspects of their study.

      Understanding the influence of sequence-dependent and epigenetic modifications of DNA on mechanical properties and nucleosome formation is crucial for comprehending various cellular processes. The authors’ study, focusing on these aspects, definitely will garner interest from the DNA methylation research community.

      Training the cgNA+ model on alternative MD simulation datasets is certainly of interest to us. However, due to the significant computational cost, this remains a goal for future work. The relationship between nucleosome occupancy scores and nucleosome wrapping energy is still debated, as noted in our Discussion section. The conflicting results may reflect differences in experimental conditions and the contribution of cellular factors other than DNA mechanics to nucleosome formation in vivo. For instance, P´erez et al. (2012), Battistini et al. (2021), and Ngo et al. (2016) concluded that DNA methylation reduces nucleosome formation based on experiments with modified Widom 601 sequences. In contrast, the genome-wide methylation study by Collings and Anderson (2017) found the opposite effect. In our work, we also use whole-genome nucleosome occupancy data.

      Comments on revised version:

      The authors have addressed most of my comments and concerns regarding this manuscript.

      Reviewer #2 (Public Review):

      Summary:

      This study uses a coarse-grained model for double stranded DNA, cgNA+, to assess nucleosome sequence affinity. cgNA+ coarse-grains DNA on the level of bases and accounts also explicitly for the positions of the backbone phosphates. It has been proven to reproduce all-atom MD data very accurately. It is also ideally suited to be incorporated into a nucleosome model because it is known that DNA is bound to the protein core of the nucleosome via the phosphates.

      It is still unclear whether this harmonic model parametrized for unbound DNA is accurate enough to describe DNA inside the nucleosome. Previous models by other authors, using more coarse-grained models of DNA, have been rather successful in predicting base pair sequence dependent nucleosome behavior. This is at least the case as long as DNA shape is concerned whereas assessing the role of DNA bendability (something this paper focuses on) has been consistently challenging in all nucleosome models to my knowledge.

      It is thus of major interest whether this more sophisticated model is also more successful in handling this issue. As far as I can tell the work is technically sound and properly accounts for not only the energy required in wrapping DNA but also entropic effects, namely the change in entropy that DNA experiences when going from the free state to the bound state. The authors make an approximation here which seems to me to be a reasonable first step.

      Of interest is also that the authors have the parameters at hand to study the effect of methylation of CpG-steps. This is especially interesting as this allows to study a scenario where changes in the physical properties of base pair steps via methylation might influence nucleosome positioning and stability in a cell-type specific way.

      Overall, this is an important contribution to the questions of how sequence affects nucleosome positioning and affinity. The findings suggest that cgNA+ has something new to offer. But the problem is complex, also on the experimental side, so many questions remain open. Despite of this, I highly recommend publication of this manuscript.

      Strengths:

      The authors use their state-of-the-art coarse grained DNA model which seems ideally suited to be applied to nucleosomes as it accounts explicitly for the backbone phosphates.

      Weaknesses:

      The authors introduce penalty coefficients c<sub>i</sub> to avoid steric clashes between the two DNA turns in the nucleosome. This requires c<sub>i</sub>-values that are so high that standard deviations in the fluctuations of the simulation are smaller than in the experiments.

      Indeed, smaller c<sub>i</sub> values lead to steric clashes between the two turns of DNA. A possible improvement of our optimisation method and a direction of future work would be adding a penalty which prevents steric clashes to the objective function. Then the c<sub>i</sub> values could be reduced to have bigger fluctuations that are even closer to the experimental structures.

      Reviewer #3 (Public Review):

      Summary:

      In this study, authors utilize biophysical modeling to investigate differences in free energies and nucleosomal configuration probability density of CpG islands and nonmethylated regions in the genome. Toward this goal, they develop and apply the cgNA+ coarse-grained model, an extension of their prior molecular modeling framework.

      Strengths:

      The study utilizes biophysical modeling to gain mechanistic insight into nucleosomal occupancy differences in CpG and nonmethylated regions in the genome.

      Weaknesses:

      Although the overall study is interesting, the manuscripts need more clarity in places. Moreover, the rationale and conclusion for some of the analyses are not well described.

      We have revised the manuscript in accordance with the reviewer’s latest suggestions.

      Comments on revised version:

      Authors have attempted to address previously raised concerns.

      Reviewer #1 (Recommendations for the authors):

      The authors have addressed most of my comments and concerns regarding this manuscript. Among them, the most significant pertains to fitting the coarse-grained model using a different all-atom force field to verify the conclusions. The authors acknowledged this point but noted the computational cost involved and proposed it as a direction for future work. Overall, I recommend the revised version for publication.

      Reviewer #2 (Recommendations for the authors):

      My previous comments were addressed satisfactorily.

      Reviewer #3 (Recommendations for the authors):

      Authors have attempted to address previously raised concerns. However, some concerns listed below remain that need to be addressed.

      (1) The first reviewer makes a valid point regarding the reconciliation of conflicting observations related to nucleosome-forming affinity and wrapping energy. Unfortunately, the authors don’t seem to address this and state that this will be the goal for the future study.

      Training the cgNA+ model on alternative MD simulation datasets remains future work. However, we revised the Discussion section to more clearly address the conflicting experimental findings in the literature on how DNA methylation influences nucleosome formation.

      (2) Please report the effect size and statistical significance value for Figures 7 and 8, as this information is currently not provided, despite the authors’ claim that these observations are statistically significant.

      This information is now presented in Supplementary Tables S1-S4.

      (3) In response to the discrepancy in cell lines for correlating nucleosome occupancy and methylation analyses, the authors claim that there is no publicly available nucleosome occupancy and methylation data for a human cell type within the human genome. This claim is confusing, as the GM12878 cell line has been extensively characterized with MNaseseq and WGBS.

      We thank the reviewer for this remark. We have removed the statement regarding the lack of data from the manuscript; we intend to examine the suggested cell line in future research.

      (4) In response to my question, the authors claimed that they selected regions from chromosome 1 exclusively; however, the observation remains unchanged when considering sequence samples from different genomic regions. They should provide examples from different chromosomes as part of the supplementary information to further support this.

      The examples of corresponding plots for other nucleosomes are now shown in Supplementary Figure S9.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In this manuscript, Domingo et al. present a novel perturbation-based approach to experimentally modulate the dosage of genes in cell lines. Their approach is capable of gradually increasing and decreasing gene expression. The authors then use their approach to perturb three key transcription factors and measure the downstream effects on gene expression. Their analysis of the dosage response curve of downstream genes reveals marked non-linearity.

      One of the strengths of this study is that many of the perturbations fall within the physiological range for each cis gene. This range is presumably between a single-copy state of heterozygous loss-of-function (log fold change of -1) and a three-copy state (log fold change of ~0.6). This is in contrast with CRISPRi or CRISPRa studies that attempt to maximize the effect of the perturbation, which may result in downstream effects that are not representative of physiological responses.

      Another strength of the study is that various points along the dosage-response curve were assayed for each perturbed gene. This allowed the authors to effectively characterize the degree of linearity and monotonicity of each dosage-response relationship. Ultimately, the study revealed that many of these relationships are non-linear, and that the response to activation can be dramatically different than the response to inhibition.

      To test their ability to gradually modulate dosage, the authors chose to measure three transcription factors and around 80 known downstream targets. As the authors themselves point out in their discussion about MYB, this biased sample of genes makes it unclear how this approach would generalize genome-wide. In addition, the data generated from this small sample of genes may not represent genome-wide patterns of dosage response. Nevertheless, this unique data set and approach represents a first step in understanding dosage-response relationships between genes.

      Another point of general concern in such screens is the use of the immortalized K562 cell line. It is unclear how the biology of these cell lines translates to the in vivo biology of primary cells. However, the authors do follow up with cell-type-specific analyses (Figures 4B, 4C, and 5A) to draw a correspondence between their perturbation results and the relevant biology in primary cells and complex diseases.

      The conclusions of the study are generally well supported with statistical analysis throughout the manuscript. As an example, the authors utilize well-known model selection methods to identify when there was evidence for non-linear dosage response relationships.

      Gradual modulation of gene dosage is a useful approach to model physiological variation in dosage. Experimental perturbation screens that use CRISPR inhibition or activation often use guide RNAs targeting the transcription start site to maximize their effect on gene expression. Generating a physiological range of variation will allow others to better model physiological conditions.

      There is broad interest in the field to identify gene regulatory networks using experimental perturbation approaches. The data from this study provides a good resource for such analytical approaches, especially since both inhibition and activation were tested. In addition, these data provide a nuanced, continuous representation of the relationship between effectors and downstream targets, which may play a role in the development of more rigorous regulatory networks.

      Human geneticists often focus on loss-of-function variants, which represent natural knock-down experiments, to determine the role of a gene in the biology of a trait. This study demonstrates that dosage response relationships are often non-linear, meaning that the effect of a loss-of-function variant may not necessarily carry information about increases in gene dosage. For the field, this implies that others should continue to focus on both inhibition and activation to fully characterize the relationship between gene and trait.

      We thank the reviewer for their thoughtful and thorough evaluation of our study. We appreciate their recognition of the strengths of our approach, particularly the ability to modulate gene dosage within a physiological range and to capture non-linear dosage-response relationships. We also agree with the reviewer’s points regarding the limitations of gene selection and the use of K562 cells, and we are encouraged that the reviewer found our follow-up analyses and statistical framework to be well-supported. We believe this work provides a valuable foundation for future genome-wide applications and more physiologically relevant perturbation studies.

      Reviewer #2 (Public review):

      Summary:

      This work investigates transcriptional responses to varying levels of transcription factors (TFs). The authors aim for gradual up- and down-regulation of three transcription factors GFI1B, NFE2, and MYB in K562 cells, by using a CRISPRa- and a CRISPRi line, together with sgRNAs of varying potency. Targeted single-cell RNA sequencing is then used to measure gene expression of a set of 90 genes, which were previously shown to be downstream of GFI1B and NFE2 regulation. This is followed by an extensive computational analysis of the scRNA-seq dataset. By grouping cells with the same perturbations, the authors can obtain groups of cells with varying average TF expression levels. The achieved perturbations are generally subtle, not reaching half or double doses for most samples, and up-regulation is generally weak below 1.5-fold in most cases. Even in this small range, many target genes exhibit a non-linear response. Since this is rather unexpected, it is crucial to rule out technical reasons for these observations.

      We thank the reviewer for their detailed and thoughtful assessment of our work. We are encouraged by their recognition of the strengths of our study, including the value of quantitative CRISPR-based perturbation coupled with single-cell transcriptomics, and its potential to inform gene regulatory network inference. Below, we address each of the concerns raised:

      Strengths:

      The work showcases how a single dataset of CRISPRi/a perturbations with scRNA-seq readout and an extended computational analysis can be used to estimate transcriptome dose responses, a general approach that likely can be built upon in the future.

      Weaknesses:

      (1) The experiment was only performed in a single replicate. In the absence of an independent validation of the main findings, the robustness of the observations remains unclear.

      We acknowledge that our study was performed in a single pooled experiment. While additional replicates would certainly strengthen the findings, in high-throughput single-cell CRISPR screens, individual cells with the same perturbation serve as effective internal replicates. This is a common practice in the field. Nevertheless, we agree that biological replicates would help control for broader technical or environmental effects.

      (2) The analysis is based on the calculation of log-fold changes between groups of single cells with non-targeting controls and those carrying a guide RNA driving a specific knockdown. How the fold changes were calculated exactly remains unclear, since it is only stated that the FindMarkers function from the Seurat package was used, which is likely not optimal for quantitative estimates. Furthermore, differential gene expression analysis of scRNA-seq data can suffer from data distortion and mis-estimations (Heumos et al. 2023 (https://doi.org/10.1038/s41576-023-00586-w), Nguyen et al. 2023 (https://doi.org/10.1038/s41467-023-37126-3)). In general, the pseudo-bulk approach used is suitable, but the correct treatment of drop-outs in the scRNA-seq analysis is essential.

      We thank the reviewer for highlighting recent concerns in the field. A study benchmarking association testing methods for perturb-seq data found that among existing methods, Seurat’s FindMarkers function performed the best (T. Barry et al. 2024).

      In the revised Methods, we now specify the formula used to calculate fold change and clarify that the estimates are derived from the Wilcoxon test implemented in Seurat’s FindMarkers function. We also employed pseudo-bulk grouping to mitigate single-cell noise and dropout effects.

      (3) Two different cell lines are used to construct dose-response curves, where a CRISPRi line allows gene down-regulation and the CRISPRa line allows gene upregulation. Although both lines are derived from the same parental line (K562) the expression analysis of Tet2, which is absent in the CRISPRi line, but expressed in the CRISPRa line (Figure S3A) suggests substantial clonal differences between the two lines. Similarly, the PCA in S4A suggests strong batch effects between the two lines. These might confound this analysis.

      We agree that baseline differences between CRISPRi and CRISPRa lines could introduce confounding effects if not appropriately controlled for. We emphasize that all comparisons are made as fold changes relative to non-targeting control (NTC) cells within each line, thereby controlling for batch- and clone-specific baseline expression. See figures S4A and S4B.

      (4) The study uses pseudo-bulk analysis to estimate the relationship between TF dose and target gene expression. This requires a system that allows quantitative changes in TF expression. The data provided does not convincingly show that this condition is met, which however is an essential prerequisite for the presented conclusions. Specifically, the data shown in Figure S3A shows that upon stronger knock-down, a subpopulation of cells appears, where the targeted TF is not detected anymore (drop-outs). Also Figure 3B (top) suggests that the knock-down is either subtle (similar to NTCs) or strong, but intermediate knock-down (log2-FC of 0.5-1) does not occur. Although the authors argue that this is a technical effect of the scRNA-seq protocol, it is also possible that this represents a binary behavior of the CRISPRi system. Previous work has shown that CRISPRi systems with the KRAB domain largely result in binary repression and not in gradual down-regulation as suggested in this study (Bintu et al. 2016 (https://doi.org/10.1126/science.aab2956), Noviello et al. 2023 (https://doi.org/10.1038/s41467-023-38909-4)).

      Figure S3A shows normalized expression values, not fold changes. A pseudobulk approach reduces single-cell noise and dropout effects. To test whether dropout events reflect true binary repression or technical effects, we compared trans-effects across cells with zero versus low-but-detectable target gene expression (Figure S3B). These effects were highly concordant, supporting the interpretation that dropout is largely technical in origin. We agree that KRAB-based repression can exhibit binary behavior in some contexts, but our data suggest that cells with intermediate repression exist and are biologically meaningful. In ongoing unpublished work, we pursue further analysis of these data at the single cell level, and show that for nearly all guides the dosage effects are indeed gradual rather than driven by binary effects across cells.

      (5) One of the major conclusions of the study is that non-linear behavior is common. This is not surprising for gene up-regulation, since gene expression will reach a plateau at some point, but it is surprising to be observed for many genes upon TF down-regulation. Specifically, here the target gene responds to a small reduction of TF dose but shows the same response to a stronger knock-down. It would be essential to show that his observation does not arise from the technical concerns described in the previous point and it would require independent experimental validations.

      This phenomenon—where relatively small changes in cis gene dosage can exceed the magnitude of cis gene perturbations—is not unique to our study. This also makes biological sense, since transcription factors are known to be highly dosage sensitive and generally show a smaller range of variation than many other genes (that are regulated by TFs). Empirically, these effects have been observed in previous CRISPR perturbation screens conducted in K562 cells, including those by Morris et al. (2023), Gasperini et al. (2019), and Replogle et al. (2022), to name but a few studies that our lab has personally examined the data of.

      (6) One of the conclusions of the study is that guide tiling is superior to other methods such as sgRNA mismatches. However, the comparison is unfair, since different numbers of guides are used in the different approaches. Relatedly, the authors point out that tiling sometimes surpassed the effects of TSS-targeting sgRNAs, however, this was the least fair comparison (2 TSS vs 10 tiling guides) and additionally depends on the accurate annotation of TSS in the relevant cell line.

      We do not draw this conclusion simply from observing the range achieved but from a more holistic observation. We would like to clarify that the number of sgRNAs used in each approach is proportional to the number of base pairs that can be targeted in each region: while the TSS-targeting strategy is typically constrained to a small window of a few dozen base pairs, tiling covers multiple kilobases upstream and downstream, resulting in more guides by design rather than by experimental bias. The guides with mismatches do not have a great performance for gradual upregulation.

      We would also like to point out that the observation that the strongest effects can arise from regions outside the annotated TSS is not unique to our study and has been demonstrated in prior work (referenced in the text).

      To address this concern, we have revised the text to clarify that we do not consider guide tiling to be inherently superior to other approaches such as sgRNA mismatches. Rather, we now describe tiling as a practical and straightforward strategy to obtain a wide range of gene dosage effects without requiring prior knowledge beyond the approximate location of the TSS. We believe this rephrasing more accurately reflects the intent and scope of our comparison.

      (7) Did the authors achieve their aims? Do the results support the conclusions?: Some of the most important conclusions are not well supported because they rely on accurately determining the quantitative responses of trans genes, which suffers from the previously mentioned concerns.

      We appreciate the reviewer’s concern, but we would have wished for a more detailed characterization of which conclusions are not supported, given that we believe our approach actually accounts for the major concerns raised above. We believe that the observation of non-linear effects is a robust conclusion that is also consistent with known biology, with this paper introducing new ways to analyze this phenomenon.

      (8) Discussion of the likely impact of the work on the field, and the utility of the methods and data to the community:

      Together with other recent publications, this work emphasizes the need to study transcription factor function with quantitative perturbations. Missing documentation of the computational code repository reduces the utility of the methods and data significantly.

      Documentation is included as inline comments within the R code files to guide users through the analysis workflow.

      Reviewer #1 (Recommendations for the authors):

      In Figure 3C (and similar plots of dosage response curves throughout the manuscript), we initially misinterpreted the plots because we assumed that the zero log fold change on the horizontal axis was in the middle of the plot. This gives the incorrect interpretation that the trans genes are insensitive to loss of GFI1B in Figure 3C, for instance. We think it may be helpful to add a line to mark the zero log fold change point, as was done in Figure 3A.

      We thank the reviewer for this helpful suggestion. To improve clarity, we have added a vertical line marking the zero log fold change point in Figure 3C and all similar dosage-response plots. We agree this makes the plots easier to interpret at a glance.

      Similarly, for heatmaps in the style of Figure 3B, it may be nice to have a column for the non-targeting controls, which should be a white column between the perturbations that increase versus decrease GFI1B.

      We appreciate the suggestion. However, because all perturbation effects are computed relative to the non-targeting control (NTC) cells, explicitly including a separate column for NTC in the heatmap would add limited interpretive value and could unnecessarily clutter the figure. For clarity, we have emphasized in the figure legend that the fold changes are relative to the NTC baseline.

      We found it challenging to assess the degree of uncertainty in the estimation of log fold changes throughout the paper. For example, the authors state the following on line 190: "We observed substantial differences in the effects of the same guide on the CRISPRi and CRISPRa backgrounds, with no significant correlation between cis gene fold-changes." This claim was challenging to assess because there are no horizontal or vertical error bars on any of the points in Figure 2A. If the log fold change estimates are very noisy, the data could be consistent with noisy observations of a correlated underlying process. Similarly, to our understanding, the dosage response curves are fit assuming that the cis log fold changes are fixed. If there is excessive noise in the estimation of these log fold changes, it may bias the estimated curves. It may be helpful to give an idea of the amount of estimation error in the cis log fold changes.

      We agree that assessing the uncertainty in log fold change estimates is important for interpreting both the lack of correlation between CRISPRi and CRISPRa effects (Figure 2A) and the robustness of the dosage-response modeling.

      In response, we have now updated Figure 2A to include both vertical and horizontal error bars, representing the standard errors of the log2 fold-change estimates for each guide in the CRISPRi and CRISPRa conditions. These error estimates were computed based on the differential expression analysis performed using the FindMarkers function in Seurat, which models gene expression differences between perturbed and control cells. We also now clarify this in the figure legend and methods.

      The authors mention hierarchical clustering on line 313, which identified six clusters. Although a dendrogram is provided, these clusters are not displayed in Figure 4A. We recommend displaying these clusters alongside the dendrogram.

      We have added colored bars indicating the clusters to improve the clarity. Thank you for the suggestion.

      In Figures 4B and 4C, it was not immediately clear what some of the gene annotations meant. For example, neither the text nor the figure legend discusses what "WBCs", "Platelets", "RBCs", or "Reticulocytes" mean. It would be helpful to include this somewhere other than only the methods to make the figure more clear.

      To improve clarity, we have updated the figure legends for Figures 4B and 4C to explicitly define these abbreviations.

      We struggled to interpret Figure 4E. Although the authors focus on the association of MYB with pHaplo, we would have appreciated some general discussion about the pattern of associations seen in the figure and what the authors expected to observe.

      We have changed the paragraph to add more exposition and clarification:

      “The link between selective constraint and response properties is most apparent in the MYB trans network. Specifically, the probability of haploinsufficiency (pHaplo) shows a significant negative correlation with the dynamic range of transcriptional responses (Figure 4G): genes under stronger constraint (higher pHaplo) display smaller dynamic ranges, indicating that dosage-sensitive genes are more tightly buffered against changes in MYB levels. This pattern was not reproduced in the other trans networks (Figure 4E)”.

      Line 71: potentially incorrect use of "rending" and incorrect sentence grammar.

      Fixed

      Line 123: "co-expression correlation across co-expression clusters" - authors may not have intended to use "co-expression" twice.

      Original sentence was correct.

      Line 246: "correlations" is used twice in "correlations gene-specific correlations."

      Fixed.

      Reviewer #2 (Recommendations for the authors):

      (1) To show that the approach indeed allows gradual down-regulation it would be important to quantify the know-down strength with a single-cell readout for a subset of sgRNAs individually (e.g. flowfish/protein staining flow cytometry).

      We agree that single-cell validation of knockdown strength using orthogonal approaches such as flowFISH or protein staining would provide additional support. However, such experiments fall outside the scope of the current study and are not feasible at this stage. We note that the observed transcriptomic changes and dosage responses across multiple perturbations are consistent with effective and graded modulation of gene expression.

      (2) Similarly, an independent validation of the observed dose-response relationships, e.g. with individual sgRNAs, can be helpful to support the conclusions about non-linear responses.

      Fig. S4C includes replication of trans-effects for a handful of guides used both in this study and in Morris et al. While further orthogonal validation of dose-response relationships would be valuable, such extensive additional work is not currently feasible within the scope of this study. Nonetheless, the high degree of replication in Fig. S4C as well as consistency of patterns observed across multiple sgRNAs and target genes provides strong support for the conclusions drawn from our high-throughput screen.

      (3) The calculation of the log2 fold changes should be documented more precisely. To perform a pseudo-bulk analysis, the raw UMI counts should be summed up in each group (NTC, individual targeting sgRNAs), including zero counts, then the data should be normalized and the fold change should be calculated. The DESeq package for example would be useful here.

      We have updated the methods in the manuscript to provide more exposition of how the logFC was calculated:

      “In our differential expression (DE) analysis, we used Seurat’s FindMarkers() function, which computes the log fold change as the difference between the average normalized gene expression in each group on the natural log scale:

      Logfc = log_e(mean(expression in group 1)) - log_e(mean(expression in group 2))

      This is calculated in pseudobulk where cells with the same sgRNA are grouped together and the mean expression is compared to the mean expression of cells harbouring NTC guides. To calculate per-gene differential expression p-value between the two cell groups (cells with sgRNA vs cells with NTC), Wilcoxon Rank-Sum test was used”.

      (4) A more careful characterization of the cell lines used would be helpful. First, it would be useful to include the quality controls performed when the clonal lines were selected, in the manuscript. Moreover, a transcriptome analysis in comparison to the parental cell line could be performed to show that the cell lines are comparable. In addition, it could be helpful to perform the analysis of the samples separately to see how many of the response behaviors would still be observed.

      Details of the quality control steps used during the selection of the CRISPRa clonal line are already included in the Methods section, and Fig. S4A shows the transcriptome comparison of CRISPRi and CRISPRa lines also for non-targeting guides. Regarding the transcriptomic comparison with the parental cell line, we agree that such an analysis would be informative; however, this would require additional experiments that are not feasible within the scope of the current study. Finally, while analyzing the samples separately could provide further insight into response heterogeneity, we focused on identifying robust patterns across perturbations that are reproducible in our pooled screening framework. We believe these aggregate analyses capture the major response behaviors and support the conclusions drawn.

      (5) In general we were surprised to see such strong responses in some of the trans genes, in some cases exceeding the fold changes of the cis gene perturbation more than 2x, even at the relatively modest cis gene perturbations (Figures S5-S8). How can this be explained?

      This phenomenon—where trans gene responses can exceed the magnitude of cis gene perturbations—is not unique to our study. Similar effects have been observed in previous CRISPR perturbation screens conducted in K562 cells, including those by Morris et al. (2023), Gasperini et al. (2019), and Replogle et al. (2022).

      Several factors may contribute to this pattern. One possibility is that certain trans genes are highly sensitive to transcription factor dosage, and therefore exhibit amplified expression changes in response to relatively modest upstream perturbations. Transcription factors are known to be highly dosage sensitive and generally show a smaller range of variation than many other genes (that are regulated by TFs). Mechanistically, this may involve non-linear signal propagation through regulatory networks, in which intermediate regulators or feedback loops amplify the downstream transcriptional response. While our dataset cannot fully disentangle these indirect effects, the consistency of this observation across multiple studies suggests it is a common feature of transcriptional regulation in K562 cells.

      (6) In the analysis shown in Figure S3B, the correlation between cells with zero count and >0 counts for the cis gene is calculated. For comparison, this analysis should also show the correlation between the cells with similar cis-gene expression and between truly different populations (e.g. NTC vs strong sgRNA).

      The intent of Figure S3B was not to compare biologically distinct populations or perform differential expression analyses—which we have already conducted and reported elsewhere in the manuscript—but rather to assess whether fold change estimates could be biased by differences in the baseline expression of the target gene across individual cells. Specifically, we sought to determine whether cells with zero versus non-zero expression (as can result from dropouts or binary on/off repression from the KRAB-based CRISPRi system) exhibit systematic differences that could distort fold change estimation. As such, the comparisons suggested by the reviewer do not directly relate to the goal of the analysis which Figure S3B was intended to show.

      (7) It is unclear why the correlation between different lanes is assessed as quality control metrics in Figure S1C. This does not substitute for replicates.

      The intent of Figure S1C was not to serve as a general quality control metric, but rather to illustrate that the targeted transcript capture approach yielded consistent and specific signal across lanes. We acknowledge that this may have been unclear and have revised the relevant sentence in the text to avoid misinterpretation.

      “We used the protein hashes and the dCas9 cDNA (indicating the presence or absence of the KRAB domain) to demultiplex and determine the cell line—CRISPRi or CRISPRa. Cells containing a single sgRNA were identified using a Gaussian mixture model (see Methods). Standard quality control procedures were applied to the scRNA-seq data (see Methods). To confirm that the targeted transcript capture approach worked as intended, we assessed concordance across capture lanes (Figure S1C)”.

      (8) Figures and legends often miss important information. Figure 3B and S5-S8: what do the transparent bars represent? Figure S1A: color bar label missing. Figure S4D: what are the lines?, Figure S9A: what is the red line? In Figure S8 some of the fitted curves do not overlap with the data points, e.g. PKM. Fig. 2C: why are there more than 96 guide RNAs (see y-axis)?

      We have addressed each point as follows:

      Figure 3B: The figure legend has been updated to clarify the meaning of the transparent bars.

      Figures S5–S8: There are no transparent bars in these figures; we confirmed this in the source plots.

      Figure S1A: The color bar label is already described in the figure legend, but we have reformulated the caption text to make this clearer.

      Figure S4D: The dashed line represents a linear regression between the x and y variables. The figure caption has been updated accordingly.

      Figure S9A: We clarified that the red line shows the median ∆AIC across all genes and conditions.

      Figure S8: We agree that some fitted curves (e.g., PKM) do not closely follow the data points. This reflects high noise in these specific measurements; as noted in the text, TET2 is not expected to exert strong trans effects in this context.

      Figure 2C: Thank you for catching this. The y-axis numbers were incorrect because the figure displays the proportion of guides (summing to 100%), not raw counts. We have corrected the y-axis label and updated the numbers in the figure to resolve this inconsistency.

      (9) The code is deposited on Github, but documentation is missing.

      Documentation is included as inline comments within the R code files to guide users through the analysis workflow.

      (10) The methods miss a list of sgRNA target sequences.

      We thank the reviewer for this observation. A complete table containing all processed data, including the sequences of the sgRNAs used in this study, is available at the following GEO link:

      https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE257547&format=file&file=GSE257547%5Fd2n%5Fprocessed%5Fdata%2Etxt%2Egz

      (11) In some parts, the language could be more specific and/or the readability improved, for example:

      Line 88: "quantitative landscape".

      Changed to “quantitative patterns”.

      Lines 88-91: long sentence hard to read.

      This complex sentence was broken up into two simpler ones:

      “We uncovered quantitative patterns of how gradual changes in transcription dosage lead to linear and non-linear responses in downstream genes. Many downstream genes are associated with rare and complex diseases, with potential effects on cellular phenotypes”.

      Line 110: "tiling sgRNAs +/- 1000 bp from the TSS", could maybe be specified by adding that the average distance was around 100 or 110 bps?

      Lines 244-246: hard to understand.

      We struggle to see the issue here and are not sure how it can be reworded.

      Lines 339-342: hard to understand.

      These sentences have been reworded to provide more clarity.

      (12) A number of typos, and errors are found in the manuscript:

      Line 71: "SOX2" -> "SOX9".

      FIXED

      Line 73: "rending" -> maybe "raising" or "posing"?

      FIXED

      Line 157: "biassed".

      FIXED

      Line 245: "exhibited correlations gene-specific correlations with".

      FIXED

      Multiple instances, e.g. 261: "transgene" -> "trans gene".

      FIXED

      Line 332: "not reproduced with among the other".

      FIXED

      Figure S11: betweenness.

      This is the correct spelling

      There are more typos that we didn't list here.

      We went through the manuscript and corrected all the spelling errors and typos.

    1. Author response:

      Reviewer #1:

      We appreciate the reviewer’s positive assessment of TSvelo and their helpful technical comments. In the revised manuscript, we will:

      (1) Provide a clearer discussion of TF–target annotations, their limitations, and potential integration of additional databases.

      (2) Clarify the rationale for example-gene selection (e.g., in Fig. 2d).

      (3) Re-evaluate and temper the interpretation regarding ANXA4 and early-stage cell-cycle transitions.

      (4) Add appropriate references supporting neuronal inside-out migration.

      (5) Include additional analysis comparing TF-based transcription rate estimation with ATAC-based estimates from MultiVelo.

      (6) Clarify how lineages were determined in Fig. 6g and incorporate barcode-based validation where applicable.

      (7) Correct all typographical errors noted.

      Reviewer #2:

      We appreciate the reviewer’s careful examination of modeling, benchmarking, and interpretation. To address these concerns, we will:

      (1) Expand the methodological justification for initial-state selection, add simulations with ground truth, and evaluate U-to-S delay more broadly across genes.

      (2) Clarify matrix formulations and ensure consistency in notation (e.g., Eq. 8).

      (3) Assess robustness to prior-knowledge graphs and evaluate alternatives beyond ENCODE/ChEA.

      (4) Add methodological details on parameter search.

      (5) Improve benchmarking on pancreatic endocrine datasets by including clear definitions of velocity pseudotime, ARI for cell-type separation, quantitative evaluation of phase-portrait fits, and appropriate interpretation of consistency metrics for multi-lineage systems.

      (6) Reframe claims about “accurate” or “correct” predictions where evidence is qualitative and strengthen quantitative support where possible.

      (8) Clarify lineage segmentation and merging when applying PAGA-guided multi-lineage modeling.

      Reviewer #3:

      We thank the reviewer for highlighting the need for more rigorous benchmarking and conceptual clarity. In response, we will:

      (1) Conduct an expanded simulation study incorporating different data-generating models.

      (2) Revise all strong claims to more cautious, evidence-based language.

      (3) Add a concise table summarizing conceptual and computational differences among RNA-velocity frameworks.

      (4) More clearly articulate the conceptual novelty of TSvelo relative to existing approaches.

      (5) Include runtime and memory benchmarks across representative datasets.

      (6) Explore additional methods in conceptual comparisons and benchmarking analyses.We appreciate the reviewers’ thoughtful input and agree that the suggested analyses and clarifications will significantly improve the rigor and clarity of the manuscript. We will incorporate all recommended revisions in the resubmission and provide a full, detailed, point-by-point response at that time.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The methods section is overly brief. Even if techniques are cited, more experimental details should be included. For example, since the study focuses heavily on methodology, details such as the number of PCR cycles in RT-PCR or the rationale for choosing HA and PB2 as representative in vitro transcripts should be provided.

      We thank the reviewer for this important suggestion. We have now expanded the Methods section to include the number of PCR cycles used in RT-PCR (line 407) and have explained the rationale for choosing HA and PB2 as representative transcripts (line 388).

      (2) Information on library preparation and sequencing metrics should be included. For example, the total number of reads, any filtering steps, and quality score distributions/cutoff for the analyzed reads.

      We agree and have added detailed information on library preparation, filtering criteria, quality score thresholds, and sequencing statistics for each sample (line 422, Figure S2).

      (3) In the Results section (line 115, "Quantification of error rate caused by RT"), the mutation rate attributed to viral replication is calculated. However, in line 138, it is unclear whether the reported value reflects PB2, HA, or both, and whether the comparison is based on the error rate of the same viral RNA or the mean of multiple values (as shown in Figure 3A). Please clarify whether this number applies universally to all influenza RNAs or provide the observed range.

      We appreciate this point. We have clarified in the Results (line 140) that the reported value corresponds to PB2.

      (4) Since the T7 polymerase introduced errors are only applied to the in vitro transcription control, how were these accounted for when comparing mutation rates between transcribed RNA and cell-culture-derived virus?

      We agree that errors introduced by T7 RNA polymerase are present only in the in vitro–transcribed RNA control. However, even when taking this into account, the error rate detected in the in vitro transcripts remained substantially lower than that observed in the viral RNA extracted from replicated virus (line 140, Fig.3a). Thus, the difference cannot be explained by T7-derived errors, and our conclusion regarding the elevated mutation rate in cell-culture–derived viral populations remains valid.

      (5) Figure 2 shows that a UMI group size of 4 has an error rate of zero, but this group size is not mentioned in the text. Please clarify.

      We have revised the Results (line 98) to describe the UMI group size of 4.

      Reviewer #2 (Public review):

      (1) The application of UMI-based error correction to viral population sequencing has been established in previous studies (e.g., HIV), and this manuscript does not introduce a substantial methodological or conceptual advance beyond its use in the context of influenza.

      We appreciate the reviewer’s comment and agree that UMI-based error correction has been applied previously to viral population sequencing, including HIV. However, to our knowledge, relatively few studies have quantitatively evaluated both the performance of this method and the resulting within-quasi-species mutation distributions in detail. In our manuscript, we not only validate the accuracy of UMIbased error correction in the context of influenza virus sequencing, but also quantitatively characterize the features of intra-quasi-species distributions, which provides new insights into the mutational landscape and evolutionary dynamics specific to influenza. We therefore believe that our work goes beyond a simple application of an established method.

      (2) The study lacks independent biological replicates or additional viral systems that would strengthen the generalizability of the conclusions.

      We agree with the reviewer that the lack of independent biological replicates and additional viral systems limits the generalizability of our findings. In this study, we intentionally focused on single-particle–derived populations of influenza virus to establish a proof-of-principle for our sequencing and analytical framework. While this design provided a clear demonstration of the method’s ability to capture mutation distributions at the single-particle level, we acknowledge that additional biological replicates and testing across diverse viral systems would be necessary to confirm the broader applicability of our observations. Importantly, even within this limited framework, our analysis enabled us to draw conclusions at the level of individual viral populations and to suggest the possibility of comparing their mutation distributions with known evolvability. This highlights the potential of our approach to bridge observations from single particles with broader patterns of viral evolution. In future work, we plan to expand the number of populations analyzed and include additional viral systems, which will allow us to more rigorously assess reproducibility and to establish systematic links between mutation accumulation at the single-particle level and evolutionary dynamics across viruses.

      (3) Potential sources of technical error are not explored or explicitly controlled. Key methodological details are missing, including the number of PCR cycles, the input number of molecules, and UMI family size distributions.

      We thank the reviewer for this important suggestion. We have now expanded the Methods section to include the number of PCR cycles used in RT-PCR (line 407). In addition, we have added information on the estimated number of input molecules. Regarding the UMI family size distributions, we have added the data as Figure S2 and referred to it in the revised manuscript.

      Finally, with respect to potential sources of technical error, we note that this point is already addressed in the manuscript by direct comparison with in vitro transcribed RNA controls, which encompass errors introduced throughout the entire experimental process. This comparison demonstrates that the error-correction strategy employed here effectively reduces the impact of PCR or sequencing artifacts.

      (4) The assertion that variants at ≥0.1% frequency can be reliably detected is based on total read count rather than the number of unique input molecules. Without information on UMI diversity and family sizes, the detection limit cannot be reliably assessed.

      We thank the reviewer for raising this important issue. We agree that our original description was misleading, as the reliable detection limit should not be defined solely by total read count. In the revised version, we have added information on UMI distribution and family sizes (Figure S2), and we now state the detection limit in terms of consensus reads. Specifically, we define that variants can be reliably detected when ≥10,000 consensus reads are obtained with a group size of ≥3 (line 173). 

      (5)  Although genetic variation is described, the functional relevance of observed mutations in HA and NA is not addressed or discussed.

      We appreciate the reviewer’s suggestion. In our study, we did not apply drug or immune selection pressure; therefore, we did not expect to detect mutations that are already known to cause major antigenic changes in HA or NA, and we think it is difficult to discuss such functional implications in this context. However, as noted in discussion, we did identify drug resistance–associated mutations. This observation suggests that the quasi-species pool may provide functional variation, including resistance, even in the absence of explicit selective pressure. We have clarified this point in the text to better address the reviewer’s concern (line 330).

      (6) The experimental scale is small, with only four viral populations derived from single particles analyzed. This limited sample size restricts the ability to draw broader conclusions.

      We thank the reviewer for pointing out the limitation of analyzing only four viral populations derived from single particles. We fully acknowledge that the small sample size restricts the generalizability of our conclusions. Nevertheless, we would like to emphasize that even within this limited dataset, our results consistently revealed a slight but reproducible deviation of the mutation distribution from the Poisson expectation, as well as a weak correlation with inter-strain conservation. These recurring patterns highlight the robustness of our observations despite the sample size.

      In future work, we plan to expand the number of viral populations analyzed and to monitor mutation distributions during serial passage under defined selective pressures. We believe that such expanded analyses will enable us to more reliably assess how mutations accumulate and to develop predictive frameworks for viral evolution.

      Reviewer #1 (Recommendations for the authors):

      (1)  Please mention Figure 1 and S2 in the text.

      Done. We now explicitly reference Figures 1 and S2 (renamed to S1 according to appearance order) in the appropriate sections (lines 74, 124).

      (2)  In Figure 4A, please specify which graph corresponds to PB2 and which to PB2-like sequences.

      Corrected. Figure 4A legend now specify PB2 vs. PB2-like sequences.

      (3)  Consider reducing redundancy in lines 74, 149, 170, 214, and 215.

      We thank the reviewer for this stylistic suggestion. We have revised the text to reduce redundancy in these lines.

      Reviewer #2 (Recommendations for the authors):

      (1)  The manuscript states that "with 10,000 sequencing reads per gene ...variants at ≥0.1% frequency can be reliably detected." However, this interpretation conflates raw read counts with independent input molecules.

      We have revised this statement throughout the text to clarify that sensitivity depends on the number of unique UMIs rather than raw read counts (line 173). To support this, we calculated the probability of detecting a true variant present at a frequency of 0.1% within a population. When sequencing ≥10,000 unique molecules, such a variant would be observed at least twice with a probability of approximately 99.95%. In contrast, the error rate of in vitro–transcribed RNA, reflecting errors introduced during the experimental process, was estimated to be on the order of 10⁻⁶ (line 140, Fig. 3a). Under this condition, the probability that the same artificial error would arise independently at the same position in two out of 10,000 molecules is <0.5%. Therefore, variants present at ≥0.1% can be reliably distinguished from technical artifacts and are confidently detected under our sequencing conditions.

      (2) To support the claimed sensitivity, please provide for each gene and population: (a) UMI family size distributions, (b) number of PCR cycles and input molecule counts, and (c) recalculation of the detection limit based on unique molecules.

      If possible, I encourage experimental validation of sensitivity claims, such as spike-in controls at known variant frequencies, dilution series, or technical replicates to demonstrate reproducibility at the 0.1% detection level.

      We have added (a) histograms of UMI family size distributions for each gene and population (Figure S2), (b) detailed method RT-PCR protocol and estimated input counts (line 407), and (c) recalculated detection limits (line 173).

      We appreciate the reviewer’s suggestion and fully recognize the value of spike-in experiments. However, given the observed mutation rate of T7-derived RNA and the sufficient sequencing depth in our dataset, it is evident that variants above the 0.1% threshold can be robustly detected without additional spike-in controls.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      The aim of this paper is to develop a simple method to quantify fluctuations in the partitioning of cellular elements. In particular, they propose a flow-cytometry based method coupled with a simple mathematical theory as an alternative to conventional imaging-based approaches.

      Strengths:

      The approach they develop is simple to understand and its use with flow-cytometry measurements is clearly explained. Understanding how the fluctuations in the cytoplasm partition varies for different kinds of cells is particularly interesting.

      Weaknesses:

      The theory only considers fluctuations due to cellular division events. Fluctuations in cellular components are largely affected by various intrinsic and extrinsic sources of noise and only under particular conditions does partitioning noise become the dominant source of noise. In the revised version of the manuscript, they argue that in their setup, noise due to production and degradation processes are negligible but noise due to extrinsic sources such as those stemming from cell-cycle length variability may still be important. To investigate the robustness of their modelling approach to such noise, they simulated cells following a sizer-like division strategy, a scenario that maximizes the coupling between fluctuations in cell-division time and partitioning noise. They find that estimates remain within the pre-established experimental error margin.

      We thank the Reviewer for her/his work in revising our manuscript.

      Reviewer #2 (Public review):

      Summary:

      The authors present a combined experimental and theoretical workflow to study partitioning noise arising during cell division. Such quantifications usually require time-lapse experiments, which are limited in throughput. To bypass these limitations, the authors propose to use flow-cytometry measurements instead and analyse them using a theoretical model of partitioning noise. The problem considered by the authors is relevant and the idea to use statistical models in combination with flow cytometry to boost statistical power is elegant. The authors demonstrate their approach using experimental flow cytometry measurements and validate their results using time-lapse microscopy. The approach focuses on a particular case, where the dynamics of the labelled component depends predominantly on partitioning, while turnover of components is not taken into account. The description of the methods is significantly clearer than in the previous version of the manuscript.

      We thank the Reviewer for her/his work in revising our manuscript. In the following, we address the remaining raised points.

      I have only two comments left:

      • In eq. (1) the notation has been changed/corrected, but the text immediately after it still refers to the old notation.

      We have fixed the notation.

      • Maybe I don't fully understand the reasoning provided by the authors, but it is still not entirely clear to me why microscopy-based estimates are expected to be larger. Fewer samples will increase the estimation uncertainty, but this can go either way in terms of the inferred variability.

      We thank the Reviewer for giving us the opportunity to clarify this point. In the previous answer, we focused on the role of the gating strategy, highlighting how the limited statistics available with microscopy reduce the chances of a stronger selection of the events. The explanation for why the noise is biased toward increasing the estimation of division asymmetry relies on multiple aspects: First, due to the multiple sources of noise affecting fluorescence intensity, the experimental procedure, and the segmentation protocol, the measurements of the fluorescence intensity of single cells fluctuate. This variability adds to the inherent stochasticity of the partitioning process, thereby increasing the overall variance of the distribution.

      To illustrate this effect, we simulated the microscopy data. We extracted a fraction f from a Gaussian distribution with mean µ = 𝑝 and standard deviation σ = σ<sub>𝑡𝑟𝑢𝑒</sub> , i.e. 𝑁(𝑝, σ<sub>𝑡𝑟𝑢𝑒</sub>). We then simulated different time frames by adding noise drawn from a Gaussian distribution with mean µ = 0 and standard deviation σ = σ<sub>𝑛𝑜𝑖𝑠𝑒</sub> , i.e., 𝑁(0, σ<sub>𝑛𝑜𝑖𝑠𝑒</sub>), to f. An equal process was applied to 1 − f. The added noise was resampled so that the two measurements remained independent. Figure 6 shows a sample dynamic where the empty gray circles represent the true fractions. We then fitted the two dynamics to a linear equation with a common slope and obtained an estimate of the partitioning noise.

      By repeating this process a number of times consistent with the experiment, we measured the resulting standard deviation of the new partitioning distribution. Figure 7 shows the distribution of the measured standard deviation over multiple repetitions of the simulations. Each histogram is the variance of the partitioning distribution obtained from 100 simulations of the noisy (and non noisy) fluorescence dynamic. By comparing this with the distribution of the standard deviation of the non-noisy dynamics, it is possible to observe that, on average, the added noise leads to a greater estimated variance. The magnitude of this increase depends on the variance of the added noise, but it is always biased toward larger values.

      This represents only one component of the effect. The shown distributions and simulations are intended solely to demonstrate the direction of the bias, and not to account for the exact difference between the flow cytometry and microscopy estimates. In the proposed case, where noise and true variance are equal, the resulting difference in division asymmetry is 1.3.

      A second contribution arises from the segmentation protocol. As we stated, a major limitation of the microscopy-based approach is the need for manual image segmentation. This reduces the amount of available data and introduces potential errors. Even though different checks were applied, some situations are difficult to avoid. For example, when daughter cells are very close to each other, the borders may not be precisely recognized; cells may overlap; or speckles may remain undetected. In all these cases, it is easier to overestimate the fluorescence than to underestimate it, thereby increasing the chance of an extremal event.

      Indeed, segmentation relies on both brightfield and fluorescence images. Errors in defining the cell outline are more likely when fluorescence is low, since borders, overlaps, and speckles are more evident against a darker background. This introduces an additional bias toward higher asymmetry, increasing the number of events in the tail of the partitioning distribution.

      Both aspects described above could be mitigated by increasing the available statistics. In particular, by applying stricter selection criteria, such as imposing limits on fluorescence intensity fluctuations, the distribution should approach the expected one.

      A similar issue does not arise in flow cytometry experiments. From the initial sorting procedure, which ensures a cleaner separation of peaks, to the morphological checks performed at each acquisition point, the availability of a large number of measured events reduces both measurement noise and segmentation errors.

      A discussion on these aspects has been added in the revised version of the Supplementary Materials and in the Main Text.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, participants completed two different tasks. A perceptual choice task in which they compared the sizes of pairs of items and a value-different task in which they identified the higher value option among pairs of items with the two tasks involving the same stimuli. Based on previous fMRI research, the authors sought to determine whether the superior frontal sulcus (SFS) is involved in both perceptual and value-based decisions or just one or the other. Initial fMRI analyses were devised to isolate brain regions that were activated for both types of choices and also regions that were unique to each. Transcranial magnetic stimulation was applied to the SFS in between fMRI sessions and it was found to lead to a significant decrease in accuracy and RT on the perceptual choice task but only a decrease in RT on the value-different task. Hierarchical drift-diffusion modelling of the data indicated that the TMS had led to a lowering of decision boundaries in the perceptual task and a lower of non-decision times on the value-based task. Additional analyses show that SFS covaries with model-derived estimates of cumulative evidence and that this relationship is weakened by TMS.

      Strengths:

      The paper has many strengths including the rigorous multi-pronged approach of causal manipulation, fMRI and computational modelling which offers a fresh perspective on the neural drivers of decision making. Some additional strengths include the careful paradigm design which ensured that the two types of tasks were matched for their perceptual content while orthogonalizing trial-to-trial variations in choice difficulty. The paper also lays out a number of specific hypotheses at the outset regarding the behavioural outcomes that are tied to decision model parameters and are well justified.

      Weaknesses:

      (1.1) Unless I have missed it, the SFS does not actually appear in the list of brain areas significantly activated by the perceptual and value tasks in Supplementary Tables 1 and 2. Its presence or absence from the list of significant activations is not mentioned by the authors when outlining these results in the main text. What are we to make of the fact that it is not showing significant activation in these initial analyses?

      You are right that the left SFS does not appear in our initial task-level contrasts. Those first analyses were deliberately agnostic to evidence accumulation (i.e., average BOLD by task, irrespective of trial-by-trial evidence). Consistent with prior work, SFS emerges only when we model the parametric variation in accumulated perceptual evidence.

      Accordingly, we ran a second-level GLM that included trial-wise accumulated evidence (aE) as a parametric modulator. In that analysis, the left SFS shows significant aE-related activity specifically during perceptual decisions, but not during value-based decisions (SVC in a 10-mm sphere around x = −24, y = 24, z = 36).

      To avoid confusion, we now:

      (i) explicitly separate and label the two analysis levels in the Results; (ii) state up front that SFS is not expected to appear in the task-average contrast; and (iii) add a short pointer that SFS appears once aE is included as a parametric modulator. We also edited Methods to spell out precisely how aE is constructed and entered into GLM2. This should make the logic of the two-stage analysis clearer and aligns the manuscript with the literature where SFS typically emerges only in parametric evidence models.

      (1.2) The value difference task also requires identification of the stimuli, and therefore perceptual decision-making. In light of this, the initial fMRI analyses do not seem terribly informative for the present purposes as areas that are activated for both types of tasks could conceivably be specifically supporting perceptual decision-making only. I would have thought brain areas that are playing a particular role in evidence accumulation would be best identified based on whether their BOLD response scaled with evidence strength in each condition which would make it more likely that areas particular to each type of choice can be identified. The rationale for the authors' approach could be better justified.

      We agree that both tasks require early sensory identification of the items, but the decision-relevant evidence differs by design (size difference vs. value difference), and our modelling is targeted at the evidence integration stage rather than initial identification.

      To address your concern empirically, we: (i) added session-wise plots of mean RTs showing a general speed-up across the experiment (now in the Supplement); (ii) fit a hierarchical DDM to jointly explain accuracy and RT. The DDM dissociates decision time (evidence integration) from non-decision time (encoding/response execution).

      After cTBS, perceptual decisions show a selective reduction of the decision boundary (lower accuracy, faster RTs; no drift-rate change), whereas value-based decisions show no change to boundary/drift but a decrease in non-decision time, consistent with faster sensorimotor processing or task familiarity. Thus, the TMS effect in SFS is specific to the criterion for perceptual evidence accumulation, while the RT speed-up in the value task reflects decision-irrelevant processes. We now state this explicitly in the Results and add the RT-by-run figure for transparency.

      (1.2.1) The value difference task also requires identification of the stimuli, and therefore perceptual decision-making. In light of this, the initial fMRI analyses do not seem terribly informative for the present purposes as areas that are activated for both types of tasks could conceivably be specifically supporting perceptual decision-making only.

      Thank you for prompting this clarification.

      The key point is what changes with cTBS. If SFS supported generic identification, we would expect parallel cTBS effects on drift rate (or boundary) in both tasks. Instead, we find: (a) boundary decreases selectively in perceptual decisions (consistent with SFS setting the amount of perceptual evidence required), and (b) non-decision time decreases selectively in the value task (consistent with speed-ups in encoding/response stages). Moreover, trial-by-trial SFS BOLD predicts perceptual accuracy (controlling for evidence), and neural-DDM model comparison shows SFS activity modulates boundary, not drift, during perceptual choices.

      Together, these converging behavioral, computational, and neural results argue that SFS specifically supports the criterion for perceptual evidence accumulation rather than generic visual identification.

      (1.2.2) I would have thought brain areas that are playing a particular role in evidence accumulation would be best identified based on whether their BOLD response scaled with evidence strength in each condition which would make it more likely that areas particular to each type of choice can be identified. The rationale for the authors' approach could be better justified.

      We now more explicitly justify the two-level fMRI approach. The task-average contrast addresses which networks are generally more engaged by each domain (e.g., posterior parietal for PDM; vmPFC/PCC for VDM), given identical stimuli and motor outputs. This complements, but does not substitute for, the parametric evidence analysis, which is where one expects accumulation-related regions such as SFS to emerge. We added text clarifying that the first analysis establishes domain-specific recruitment at the task level, whereas the second isolates evidence-dependent signals (aE) and reveals that left SFS tracks accumulated evidence only for perceptual choices. We also added explicit references to the literature using similar two-step logic and noted that SFS typically appears only in parametric evidence models.

      (1.3) TMS led to reductions in RT in the value-difference as well as the perceptual choice task. DDM modelling indicated that in the case of the value task, the effect was attributable to reduced non-decision time which the authors attribute to task learning. The reasoning here is a little unclear.

      (1.3.1) Comment: If task learning is the cause, then why are similar non-decision time effects not observed in the perceptual choice task?

      Great point. The DDM addresses exactly this: RT comprises decision time (DT) plus non-decision time (nDT). With cTBS, PDM shows reduced DT (via a lower boundary) but stable nDT; VDM shows reduced nDT with no change to boundary/drift. Hence, the superficially similar RT speed-ups in both tasks are explained by different latent processes: decision-relevant in PDM (lower criterion → faster decisions, lower accuracy) and decision-irrelevant in VDM (faster encoding/response). We added explicit language and a supplemental figure showing RT across runs, and we clarified in the text that only the PDM speed-up reflects a change to evidence integration.

      (1.3.2) Given that the value-task actually requires perceptual decision-making, is it not possible that SFS disruption impacted the speed with which the items could be identified, hence delaying the onset of the value-comparison choice?

      We agree there is a brief perceptual encoding phase at the start of both tasks. If cTBS impaired visual identification per se, we would expect longer nDT in both tasks or a decrease in drift rate. Instead, nDT decreases in the value task and is unchanged in the perceptual task; drift is unchanged in both. Thus, cTBS over SFS does not slow identification; rather, it lowers the criterion for perceptual accumulation (PDM) and, separately, we observe faster non-decision components in VDM (likely familiarity or motor preparation). We added a clarifying sentence noting that item identification was easy and highly overlearned (static, large food pictures), and we cite that nDT is the appropriate locus for identification effects in the DDM framework; our data do not show the pattern expected of impaired identification.

      (1.4) The sample size is relatively small. The authors state that 20 subjects is 'in the acceptable range' but it is not clear what is meant by this.

      We have clarified what we mean and provided citations. The sample (n = 20) matches or exceeds many prior causal TMS/fMRI studies targeting perceptual decision circuitry (e.g., Philiastides et al., 2011; Rahnev et al., 2016; Jackson et al., 2021; van der Plas et al., 2021; Murd et al., 2021). Importantly, we (i) use within-subject, pre/post cTBS differences-in-differences with matched tasks; (ii) estimate hierarchical models that borrow strength across participants; and (iii) converge across behavior, latent parameters, regional BOLD, and connectivity. We now replace the vague phrase with a concrete statement and references, and we report precision (HDIs/SEs) for all main effects.

      Reviewer #2 (Public Review):

      Summary:

      The authors set out to test whether a TMS-induced reduction in excitability of the left Superior Frontal Sulcus influenced evidence integration in perceptual and value-based decisions. They directly compared behaviour - including fits to a computational decision process model - and fMRI pre and post-TMS in one of each type of decision-making task. Their goal was to test domain-specific theories of the prefrontal cortex by examining whether the proposed role of the SFS in evidence integration was selective for perceptual but not value-based evidence.

      Strengths:

      The paper presents multiple credible sources of evidence for the role of the left SFS in perceptual decision-making, finding similar mechanisms to prior literature and a nuanced discussion of where they diverge from prior findings. The value-based and perceptual decision-making tasks were carefully matched in terms of stimulus display and motor response, making their comparison credible.

      Weaknesses:

      (2.1) More information on the task and details of the behavioural modelling would be helpful for interpreting the results.

      Thank you for this request for clarity. In the revision we explicitly state, up front, how the two tasks differ and how the modelling maps onto those differences.

      (1) Task separability and “evidence.” We now define task-relevant evidence as size difference (SD) for perceptual decisions (PDM) and value difference (VD) for value-based decisions (VDM). Stimuli and motor mappings are identical across tasks; only the evidence to be integrated changes.

      (2) Behavioural separability that mirrors task design. As reported, mixed-effects regressions show PDM accuracy increases with SD (β=0.560, p<0.001) but not VD (β=0.023, p=0.178), and PDM RTs shorten with SD (β=−0.057, p<0.001) but not VD (β=0.002, p=0.281). Conversely, VDM accuracy increases with VD (β=0.249, p<0.001) but not SD (β=0.005, p=0.826), and VDM RTs shorten with VD (β=−0.016, p=0.011) but not SD (β=−0.003, p=0.419).

      (3 How the HDDM reflects this. The hierarchical DDM fits the joint accuracy–RT distributions with task-specific evidence (SD or VD) as the predictor of drift. The model separates decision time from non-decision time (nDT), which is essential for interpreting the different RT patterns across tasks without assuming differences in the accumulation process when accuracy is unchanged.

      These clarifications are integrated in the Methods (Experimental paradigm; HDDM) and in Results (“Behaviour: validity of task-relevant pre-requisites” and “Modelling: faster RTs during value-based decisions is related to non-decision-related sensorimotor processes”).

      (2.2) The evidence for a choice and 'accuracy' of that choice in both tasks was determined by a rating task that was done in advance of the main testing blocks (twice for each stimulus). For the perceptual decisions, this involved asking participants to quantify a size metric for the stimuli, but the veracity of these ratings was not reported, nor was the consistency of the value-based ones. It is my understanding that the size ratings were used to define the amount of perceptual evidence in a trial, rather than the true size differences, and without seeing more data the reliability of this approach is unclear. More concerning was the effect of 'evidence level' on behaviour in the value-based task (Figure 3a). While the 'proportion correct' increases monotonically with the evidence level for the perceptual decisions, for the value-based task it increases from the lowest evidence level and then appears to plateau at just above 80%. This difference in behaviour between the two tasks brings into question the validity of the DDM which is used to fit the data, which assumes that the drift rate increases linearly in proportion to the level of evidence.

      We thank the reviewer for raising these concerns, and we address each of them point by point:

      2.2.1. Comment: It is my understanding that the size ratings were used to define the amount of perceptual evidence in a trial, rather than the true size differences, and without seeing more data the reliability of this approach is unclear.

      That is correct—we used participants’ area/size ratings to construct perceptual evidence (SD).

      To validate this choice, we compared those ratings against an objective image-based size measure (proportion of non-black pixels within the bounding box). As shown in Author response image 3, perceptual size ratings are highly correlated with objective size across participants (Pearson r values predominantly ≈0.8 or higher; all p<0.001). Importantly, value ratings do not correlate with objective size (Author response image 2), confirming that the two rating scales capture distinct constructs. These checks support using participants’ size ratings as the participant-specific ground truth for defining SD in the PDM trials.

      Author response image 1.

      Objective size and value ratings are unrelated. Scatterplots show, for each participant, the correlation between objective image size (x-axis; proportion of non-black pixels within the item box) and value-based ratings (y-axis; 0–100 scale). Each dot is one food item (ratings averaged over the two value-rating repetitions). Across participants, value ratings do not track objective size, confirming that value and size are distinct constructs.

      Author response image 2.

      Perceptual size ratings closely track objective size. Scatterplots show, for each participant, the correlation between objective image size (x-axis) and perceptual area/size ratings (y-axis; 0–100 scale). Each dot is one food item (ratings averaged over the two perceptual ratings). Perceptual ratings are strongly correlated with objective size for nearly all participants (see main text), validating the use of these ratings to construct size-difference evidence (SD).

      (2.2.2) More concerning was the effect of 'evidence level' on behaviour in the value-based task (Figure 3a). While the 'proportion correct' increases monotonically with the evidence level for the perceptual decisions, for the value-based task it increases from the lowest evidence level and then appears to plateau at just above 80%. This difference in behaviour between the two tasks brings into question the validity of the DDM which is used to fit the data, which assumes that the drift rate increases linearly in proportion to the level of evidence.

      We agree that accuracy appears to asymptote in VDM, but the DDM fits indicate that the drift rate still increases monotonically with evidence in both tasks. In Supplementary figure 11, drift (δ) rises across the four evidence levels for PDM and for VDM (panels showing all data and pre/post-TMS). The apparent plateau in proportion correct during VDM reflects higher choice variability at stronger preference differences, not a failure of the drift–evidence mapping. Crucially, the model captures both the accuracy patterns and the RT distributions (see posterior predictive checks in Supplementary figures 11-16), indicating that a monotonic evidence–drift relation is sufficient to account for the data in each task.

      Author response image 3.

      HDDM parameters by evidence level. Group-level posterior means (± posterior SD) for drift (δ), boundary (α), and non-decision time (τ) across the four evidence levels, shown (a) collapsed across TMS sessions, (b) for PDM (blue) pre- vs post-TMS (light vs dark), and (c) for VDM (orange) pre- vs post-TMS. Crucially, drift increases monotonically with evidence in both tasks, while TMS selectively lowers α in PDM and reduces τ in VDM (see Supplementary Tables for numerical estimates).

      (2.3) The paper provides very little information on the model fits (no parameter estimates, goodness of fit values or simulated behavioural predictions). The paper finds that TMS reduced the decision bound for perceptual decisions but only affected non-decision time for value-based decisions. It would aid the interpretation of this finding if the relative reliability of the fits for the two tasks was presented.

      We appreciate the suggestion and have made the quantitative fit information explicit:

      (1) Parameter estimates. Group-level means/SDs for drift (δ), boundary (α), and nDT (τ) are reported for PDM and VDM overall, by evidence level, pre- vs post-TMS, and per subject (see Supplementary Tables 8-11).

      (2) Goodness of fit and predictive adequacy. DIC values accompany each fit in the tables. Posterior predictive checks demonstrate close correspondence between simulated and observed accuracy and RT distributions overall, by evidence level, and across subjects (Supplementary Figures 11-16).

      Together, these materials document that the HDDM provides reliable fits in both tasks and accurately recovers the qualitative and quantitative patterns that underlie our inferences (reduced α for PDM only; selective τ reduction in VDM).

      (2.4) Behaviourally, the perceptual task produced decreased response times and accuracy post-TMS, consistent with a reduced bound and consistent with some prior literature. Based on the results of the computational modelling, the authors conclude that RT differences in the value-based task are due to task-related learning, while those in the perceptual task are 'decision relevant'. It is not fully clear why there would be such significantly greater task-related learning in the value-based task relative to the perceptual one. And if such learning is occurring, could it potentially also tend to increase the consistency of choices, thereby counteracting any possible TMS-induced reduction of consistency?

      Thank you for pointing out the need for a clearer framing. We have removed the speculative label “task-related learning” and now describe the pattern strictly in terms of the HDDM decomposition and neural results already reported:

      (1) VDM: Post-TMS RTs are faster while accuracy is unchanged. The HDDM attributes this to a selective reduction in non-decision time (τ), with no change in decision-relevant parameters (α, δ) for VDM (see Supplementary Figure 11 and Supplementary Tables). Consistent with this, left SFS BOLD is not reduced for VDM, and trialwise SFS activity does not predict VDM accuracy—both observations argue against a change in VDM decision formation within left SFS.

      (2) PDM: Post-TMS accuracy decreases and RTs shorten, which the HDDM captures as a lower decision boundary (α) with no change in drift (δ). Here, left SFS BOLD scales with accumulated evidence and decreases post-TMS, and trialwise SFS activity predicts PDM accuracy, all consistent with a decision-relevant effect in PDM.

      Regarding the possibility that faster VDM RTs should increase choice consistency: empirically, consistency did not change in VDM, and the HDDM finds no decision-parameter shifts there. Thus, there is no hidden counteracting increase in VDM accuracy that could mask a TMS effect—the absence of a VDM accuracy change is itself informative and aligns with the modelling and fMRI.

      Reviewer #3 (Public Review):

      Summary:

      Garcia et al., investigated whether the human left superior frontal sulcus (SFS) is involved in integrating evidence for decisions across either perceptual and/or value-based decision-making. Specifically, they had 20 participants perform two decision-making tasks (with matched stimuli and motor responses) in an fMRI scanner both before and after they received continuous theta burst transcranial magnetic stimulation (TMS) of the left SFS. The stimulation thought to decrease neural activity in the targeted region, led to reduced accuracy on the perceptual decision task only. The pattern of results across both model-free and model-based (Drift diffusion model) behavioural and fMRI analyses suggests that the left SLS plays a critical role in perceptual decisions only, with no equivalent effects found for value-based decisions. The DDM-based analyses revealed that the role of the left SLS in perceptual evidence accumulation is likely to be one of decision boundary setting. Hence the authors conclude that the left SFS plays a domain-specific causal role in the accumulation of evidence for perceptual decisions. These results are likely to add importance to the literature regarding the neural correlates of decision-making.

      Strengths:

      The use of TMS strengthens the evidence for the left SFS playing a causal role in the evidence accumulation process. By combining TMS with fMRI and advanced computational modelling of behaviour, the authors go beyond previous correlational studies in the field and provide converging behavioural, computational, and neural evidence of the specific role that the left SFS may play.

      Sophisticated and rigorous analysis approaches are used throughout.

      Weaknesses:

      (3.1) Though the stimuli and motor responses were equalised between the perception and value-based decision tasks, reaction times (according to Figure 1) and potential difficulty (Figure 2) were not matched. Hence, differences in task difficulty might represent an alternative explanation for the effects being specific to the perception task rather than domain-specificity per se.

      We agree that RTs cannot be matched a priori, and we did not intend them to be. Instead, we equated the inputs to the decision process and verified that each task relied exclusively on its task-relevant evidence. As reported in Results—Behaviour: validity of task-relevant pre-requisites (Fig. 1b–c), accuracy and RTs vary monotonically with the appropriate evidence regressor (SD for PDM; VD for VDM), with no effect of the task-irrelevant regressor. This separability check addresses differences in baseline RTs by showing that, for both tasks, behaviour tracks evidence as designed.

      To rule out a generic difficulty account of the TMS effect, we relied on the within-subject differences-in-differences (DID) framework described in Methods (Differences-in-differences). The key Task × TMS interaction compares the pre→post change in PDM with the pre→post change in VDM while controlling for trialwise evidence and RT covariates. Any time-on-task or unspecific difficulty drift shared by both tasks is subtracted out by this contrast. Using this specification, TMS selectively reduced accuracy for PDM but not VDM (Fig. 3a; Supplementary Fig. 2a,c; Supplementary Tables 5–7).

      Finally, the hierarchical DDM (already in the paper) dissociates latent mechanisms. The post-TMS boundary reduction appears only in PDM, whereas VDM shows a change in non-decision time without a decision-relevant parameter change (Fig. 3c; Supplementary Figs. 4–5). If unmatched difficulty were the sole driver, we would expect parallel effects across tasks, which we do not observe.

      (3.2) No within- or between-participants sham/control TMS condition was employed. This would have strengthened the inference that the apparent TMS effects on behavioural and neural measures can truly be attributed to the left SFS stimulation and not to non-specific peripheral stimulation and/or time-on-task effects.

      We agree that a sham/control condition would further strengthen causal attribution and note this as a limitation. In mitigation, our design incorporates several safeguards already reported in the manuscript:

      · Within-subject pre/post with alternating task blocks and DID modelling (Methods) to difference out non-specific time-on-task effects.

      · Task specificity across levels of analysis: behaviour (PDM accuracy reduction only), computational (boundary reduction only in PDM; no drift change), BOLD (reduced left-SFS accumulated-evidence signal for PDM but not VDM; Fig. 4a–c), and functional coupling (SFS–occipital PPI increase during PDM only; Fig. 5).

      · Matched stimuli and motor outputs across tasks, so any peripheral sensations or general arousal effects should have influenced both tasks similarly; they did not.

      Together, these converging task-selective effects reduce the likelihood that the results reflect non-specific stimulation or time-on-task. We will add an explicit statement in the Limitations noting the absence of sham/control and outlining it as a priority for future work.

      (3.3) No a priori power analysis is presented.

      We appreciate this point. Our sample size (n = 20) matched prior causal TMS and combined TMS–fMRI studies using similar paradigms and analyses (e.g., Philiastides et al., 2011; Rahnev et al., 2016; Jackson et al., 2021; van der Plas et al., 2021; Murd et al., 2021), and was chosen a priori on that basis and the practical constraints of cTBS + fMRI. The within-subject DID approach and hierarchical modelling further improve efficiency by leveraging all trials.

      To address the reviewer’s request for transparency, we will (i) state this rationale in Methods—Participants, and (ii) ensure that all primary effects are reported with 95% CIs or posterior probabilities (already provided for the HDDM as pmcmcp_{\mathrm{mcmc}}pmcmc). We also note that the design was sensitive enough to detect RT changes in both tasks and a selective accuracy change in PDM, arguing against a blanket lack of power as an explanation for null VDM accuracy effects. We will nevertheless flag the absence of a formal prospective power analysis in the Limitations.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations For The Authors):

      Some important elements of the methods are missing. How was the site for targeting the SFS with TMS identified? The methods described how M1 was located but not SFS.

      Thank you for catching this omission. In the revised Methods we explicitly describe how the left SFS target was localized. Briefly, we used each participant’s T1-weighted anatomical scan and frameless neuronavigation to place a 10-mm sphere at the a priori MNI coordinates (x = −24, y = 24, z = 36) derived from prior work (Heekeren et al., 2004; Philiastides et al., 2011). This sphere was transformed to native space for each participant. The coil was positioned tangentially with the handle pointing posterior-lateral, and coil placement was continuously monitored with neuronavigation throughout stimulation. (All of these procedures mirror what we already report for M1 and are now stated for SFS as well.)

      Where to revise the manuscript:

      Methods → Stimulation protocol. After the first sentence naming cTBS, insert:<br /> “The left SFS target was localized on each participant’s T1-weighted anatomical image using frameless neuronavigation. A 10-mm radius sphere was centered at the a priori MNI coordinates x = −24, y = 24, z = 36 (Heekeren et al., 2004; Philiastides et al., 2011), then transformed to native space. The MR-compatible figure-of-eight coil was positioned tangentially over the target with the handle oriented posterior-laterally, and its position was tracked and maintained with neuronavigation during stimulation.”

      It is not clear how participants were instructed that they should perform the value-difference task. Were they told that they should choose based on their original item value ratings or was it left up to them?

      We agree the instruction should be explicit. Participants were told_: “In value-based blocks, choose the item you would prefer to eat at the end of the experiment.”_ They were informed that one VDM trial would be randomly selected for actual consumption, ensuring incentive-compatibility. We did not ask them to recall or follow their earlier ratings; those ratings were used only to construct evidence (value difference) and to define choice consistency offline.

      Where to revise the manuscript:

      Methods → Experimental paradigm.

      Add a sentence to the VDM instruction paragraph:

      “In value-based (LIKE) blocks, participants were instructed to choose the item they would prefer to consume at the end of the experiment; one VDM trial was randomly selected and implemented, making choices incentive-compatible. Prior ratings were used solely to construct value-difference evidence and to score choice consistency; participants were not asked to recall or match their earlier ratings.”

      Line 86 Introduction, some previous studies were conducted on animals. Why it is problematic that the studies were conducted in animals is not stated. I assume the authors mean that we do not know if their findings will translate to the human brain? I think in fairness to those working with animals it might be worth an extra sentence to briefly expand on this point.

      We appreciate this and will clarify that animal work is invaluable for circuit-level causality, but species differences and putative non-homologous areas (e.g., human SFS vs. rodent FOF) limit direct translation. Our point is not that animal studies are problematic, but that establishing causal roles in humans remains necessary.

      Revision:

      Introduction (paragraph discussing prior animal work). Replace the current sentence beginning “However, prior studies were largely correlational”

      “Animal studies provide critical causal insights, yet direct translation to humans can be limited by species-specific anatomy and potential non-homologies (e.g., human SFS vs. frontal orienting fields in rodents). Therefore, establishing causal contributions in the human brain remains essential.”

      Line 100-101: "or whether its involvement is peripheral and merely functionally supporting a larger system" - it is not clear what you mean by 'supporting a larger system'

      We meant that observed SFS activity might reflect upstream/downstream support processes (e.g., attentional control or working-memory maintenance) rather than the computation of evidence accumulation itself. We have rephrased to avoid ambiguity.

      Revision:

      Introduction. Replace the phrase with:

      “or whether its observed activity reflects upstream or downstream support processes (e.g., attention or working-memory maintenance) rather than the accumulation computation per se.”

      The authors do have to make certain assumptions about the BOLD patterns that would be expected of an evidence accumulation region. These assumptions are reasonable and have been adopted in several previous neuroimaging studies. Nevertheless, it should be acknowledged that alternative possibilities exist and this is an inevitable limitation of using fMRI to study decision making. For example, if it turns out that participants collapse their boundaries as time elapses, then the assumption that trials with weaker evidence should have larger BOLD responses may not hold - the effect of more prolonged activity could be cancelled out by the lower boundaries. Again, I think this is just a limitation that could be acknowledged in the Discussion, my opinion is that this is the best effort yet to identify choice-relevant regions with fMRI and the authors deserve much credit for their rigorous approach.

      Agreed. We already ground our BOLD regressors in the DDM literature, but acknowledge that alternative mechanisms (e.g., time-dependent boundaries) can alter expected BOLD–evidence relations. We now add a short limitation paragraph stating this explicitly.

      Revision:

      Discussion (limitations paragraph). Add:

      “Our fMRI inferences rest on model-based assumptions linking accumulated evidence to BOLD amplitude. Alternative mechanisms—such as time-dependent (collapsing) boundaries—could attenuate the prediction that weaker-evidence trials yield longer accumulation and larger BOLD signals. While our behavioural and neural results converge under the DDM framework, we acknowledge this as a general limitation of model-based fMRI.”

      Reviewer #2 (Recommendations For The Authors):

      Minor points

      I suggest the proportion of missed trials should be reported.

      Thank you for the suggestion. In our preprocessing we excluded trials with no response within the task’s response window and any trials failing a priori validity checks. Because non-response trials contain neither a choice nor an RT, they are not entered into the DDM fits or the fMRI GLMs and, by design, carry no weight in the reported results. To keep the focus on the data that informed all analyses, we now (i) state the trial-inclusion criteria explicitly and (ii) report the number of analysed (valid) trials per task and run. This conveys the effective sample size contributing to each condition without altering the analysis set.

      Revision:

      Methods → (at the end of “Experimental paradigm”): “Analyses were conducted on valid trials only, defined as trials with a registered response within the task’s response window and passing pre-specified validity checks; trials without a response were excluded and not analysed.”

      Results → “Behaviour: validity of task-relevant pre-requisites” (add one sentence at the end of the first paragraph): “All behavioural and fMRI analyses were performed on valid trials only (see Methods for inclusion criteria).”

      Figure 4 c is very confusing. Is the legend or caption backwards?

      Thanks for flagging. We corrected the Figure 4c caption to match the colouring and contrasts used in the panel (perceptual = blue/green overlays; value-based = orange/red; ‘post–pre’ contrasts explicitly labeled). No data or analyses were changed, just the wording to remove ambiguity.

      Revision:

      Figure 4 caption (panel c sentence). Replace with:

      “(c) Post–pre contrasts for the trialwise accumulated-evidence regressor show reduced left-SFS BOLD during perceptual decisions (green overlay), with a significantly stronger reduction for perceptual vs value-based decisions (blue overlay). No reduction is observed for value-based decisions.”

      Even if not statistically significant it may be of interest to add the results for Value-based decision making on SFS in Supplementary Table 3.

      Done. We now include the SFS small-volume results for VDM (trialwise accumulated-evidence regressor) alongside the PDM values in the same table, with exact peak, cluster size, and statistics.

      Revision:

      Supplementary Table 3 (title):

      “Regions encoding trialwise accumulated evidence (parametric modulation) during perceptual and value-based decisions, including SFS SVC results for both tasks.”

      Model comparisons: please explain how model complexity is accounted for.

      We clarify that model evidence was compared using the Deviance Information Criterion (DIC), which penalizes model fit by an effective number of parameters (pD). Lower DIC indicates better out-of-sample predictive performance after accounting for model complexity.

      Revision:

      Methods → Hierarchical Bayesian neural-DDM (last paragraph). Add:

      “Model comparison used the Deviance Information Criterion (DIC = D̄ + pD), where pD is the effective number of parameters; thus DIC penalizes model complexity. Lower DIC denotes better predictive accuracy after accounting for complexity.”

      Reviewer #3 (Recommendations For The Authors):

      The following issues would benefit from clarification in the manuscript:

      - It is stated that "Our sample size is well within acceptable range, similar to that of previous TMS studies." The sample size being similar to previous studies does not mean it is within an acceptable range. Whether the sample size is acceptable or not depends on the expected effect size. It is perfectly possible that the previous studies cited were all underpowered. What implications might the lack of an a priori power analysis have for the interpretation of the results?

      We agree and have revised our wording. We did not conduct an a priori power analysis. Instead, we relied on a within-participant design that typically yields higher sensitivity in TMS–fMRI settings and on convergence across behavioural, computational, and neural measures. We now acknowledge that the absence of formal power calculations limits claims about small effects (particularly for null findings in VDM), and we frame those null results cautiously.

      Revision:

      Discussion (limitations). Add:

      “The within-participant design enhances statistical sensitivity, yet the absence of an a priori power analysis constrains our ability to rule out small effects, particularly for null results in VDM.”

      - I was confused when trying to match the results described in the 'Behaviour: validity of task-relevant pre-requisites' section on page 6 to what is presented in Figure 1. Specifically, Figure 1C is cited 4 times but I believe two of these should be citing Figure 1B?

      Thank you—this was a citation mix-up. The two places that referenced “Fig. 1C” but described accuracy should in fact point to Fig. 1B. We corrected both citations.

      Revision:

      Results → Behaviour: validity… Change the two incorrect “Fig. 1C” references (when describing accuracy) to “Fig. 1B”.

      - Also, where is the 'SD' coefficient of -0.254 (p-value = 0.123) coming from in line 211? I can't match this to the figure.

      This was a typographical error in an earlier draft. The correct coefficients are those shown in the figure and reported elsewhere in the text (evidence-specific effects: for PDM RTs, SD β = −0.057, p < 0.001; for VDM RTs, VD β = −0.016, p = 0.011; non-relevant evidence terms are n.s.). We removed the erroneous value.

      Revision:

      Results → Behaviour: validity… (sentence with −0.254). Delete the incorrect value and retain the evidence-specific coefficients consistent with Fig. 1B–C.

      - It is reported that reaction times were significantly faster for the perceptual relative to the value-based decision task. Was overall accuracy also significantly different between the two tasks? It appears from Figure 3 that it might be, But I couldn't find this reported in the text.

      To avoid conflating task with evidence composition, we did not emphasize between-task accuracy averages. Our primary tests examine evidence-specific effects and TMS-induced changes within task. For completeness, we now report descriptive mean accuracies by task and point readers to the figure panels that display accuracy as a function of evidence (which is the meaningful comparison in our matched-evidence design). We refrain from additional hypothesis testing here to keep the analyses aligned with our preregistered focus.

      Revision:

      Results → Behaviour: validity… Add:

      “For completeness, group-mean accuracies by task are provided descriptively in Fig. 3a; inferential tests in the manuscript focus on evidence-specific effects and TMS-induced changes within task.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The manuscript by Yin and colleagues addresses a long-standing question in the field of cortical morphogenesis, regarding factors that determine differential cortical folding across species and individuals with cortical malformations. The authors present work based on a computational model of cortical folding evaluated alongside a physical model that makes use of gel swelling to investigate the role of a two-layer model for cortical morphogenesis. The study assesses these models against empirically derived cortical surfaces based on MRI data from ferret, macaque monkey, and human brains.

      The manuscript is clearly written and presented, and the experimental work (physical gel modeling as well as numerical simulations) and analyses (subsequent morphometric evaluations) are conducted at the highest methodological standards. It constitutes an exemplary use of interdisciplinary approaches for addressing the question of cortical morphogenesis by bringing together well-tuned computational modeling with physical gel models. In addition, the comparative approaches used in this paper establish a foundation for broad-ranging future lines of work that investigate the impact of perturbations or abnormalities during cortical development.

      The cross-species approach taken in this study is a major strength of the work. However, correspondence across the two methodologies did not appear to be equally consistent in predicting brain folding across all three species. The results presented in Figures 4 (and Figures S3 and S4) show broad correspondence in shape index and major sulci landmarks across all three species. Nevertheless, the results presented for the human brain lack the same degree of clear correspondence for the gel model results as observed in the macaque and ferret. While this study clearly establishes a strong foundation for comparative cortical anatomy across species and the impact of perturbations on individual morphogenesis, further work that fine-tunes physical modeling of complex morphologies, such as that of the human cortex, may help to further understand the factors that determine cortical functionalization and pathologies.

      We thank the reviewer for positive opinions and helpful comments. Yes, the physical gel model of the human brain has a lower similarity index with the real brain. There are several reasons.

      First, the highly convoluted human cortex has a few major folds (primary sulci) and a very large number of minor folds associated with secondary or tertiary sulci (on scales of order comparable to the cortical thickness), relative to the ferret and macaque cerebral cortex. In our gel model, the exact shapes, positions, and orientations of these minor folds are stochastic, which makes it hard to have a very high similarity index of the gel models when compared with the brain of a single individual.

      Second, in real human brains, these minor folds evolve dynamically with age and show differences among individuals. In experiments with the gel brain, multiscale folds form and eventually disappear as the swelling progresses through the thickness. Our physical model results are snapshots during this dynamical process, which makes it hard to have a concrete one-to-one correspondence between the instantaneous shapes of the swelling gel and the growing human brain.

      Third, the growth of the brain cortex is inhomogeneous in space and varying with time, whereas, in the gel model, swelling is relatively homogeneous.

      We agree that further systematic work, based on our proposed methods, with more fine-tuned gel geometries and properties, might provide a deeper understanding of the relations between brain geometry, and growth-induced folds and their functionalization and pathologies. Further analysis of cortical pathologies using computational and physical gel models can be found in our companion paper (Choi et al., 2025), also published in eLife:

      G. P. T. Choi, C. Liu, S. Yin, G. Séjourné, R. S. Smith, C. A. Walsh, L. Mahadevan, Biophysical basis for brain folding and misfolding patterns in ferrets and humans. eLife, 14, RP107141, 2025. doi:10.7554/eLife.107141

      Reviewer# 2 (Public review):

      This manuscript explores the mechanisms underlying cerebral cortical folding using a combination of physical modelling, computational simulations, and geometric morphometrics. The authors extend their prior work on human brain development (Tallinen et al., 2014; 2016) to a comparative framework involving three mammalian species: ferrets (Carnivora), macaques (Old World monkeys), and humans (Hominoidea). By integrating swelling gel experiments with mathematical differential growth models, they simulate sulcification instability and recapitulate key features of brain folding across species. The authors make commendable use of publicly available datasets to construct 3D models of fetal and neonatal brain surfaces: fetal macaque (ref. [26]), newborn ferret (ref. [11]), and fetal human (ref. [22]).

      Using a combination of physical models and numerical simulations, the authors compare the resulting folding morphologies to real brain surfaces using morphometric analysis. Their results show qualitative and quantitative concordance with observed cortical folding patterns, supporting the view that differential tangential growth of the cortex relative to the subcortical substrate is sufficient to account for much of the diversity in cortical folding. This is a very important point in our field, and can be used in the teaching of medical students.

      Brain folding remains a topic of ongoing debate. While some regard it as a critical specialization linked to higher cognitive function, others consider it an epiphenomenon of expansion and constrained geometry. This divergence was evident in discussions during the Strungmann Forum on cortical development (Silver¨ et al., 2019). Though folding abnormalities are reliable indicators of disrupted neurodevelopmental processes (e.g., neurogenesis, migration), their relationship to functional architecture remains unclear. Recent evidence suggests that the absolute number of neurons varies significantly with position-sulcus versus gyrus-with potential implications for local processing capacity (e.g., https://doi.org/10.1002/cne.25626). The field is thus in need of comparative, mechanistic studies like the present one.

      This paper offers an elegant and timely contribution by combining gel-based morphogenesis, numerical modelling, and morphometric analysis to examine cortical folding across species. The experimental design - constructing two-layer PDMS models from 3D MRI data and immersing them in organic solvents to induce differential swelling - is well-established in prior literature. The authors further complement this with a continuum mechanics model simulating folding as a result of differential growth, as well as a comparative analysis of surface morphologies derived from in vivo, in vitro, and in silico brains.

      We thank the reviewer for the very positive comments.

      I offer a few suggestions here for clarification and further exploration:

      Major Comments

      (1) Choice of Developmental Stages and Initial Conditions

      The authors should provide a clearer justification for the specific developmental stages chosen (e.g., G85 for macaque, GW23 for human). How sensitive are the resulting folding patterns to the initial surface geometry of the gel models? Given that folding is a nonlinear process, early geometric perturbations may propagate into divergent morphologies. Exploring this sensitivity-either through simulations or reference to prior work-would enhance the robustness of the findings.

      The initial geometry is one of the important factors that decides the final folding pattern. The smooth brain in the early developmental stage shows a broad consistency across individuals, and we expect the main folds to form similarly across species and individuals.

      Generally, we choose the initial geometry when the brain cortex is still relatively smooth. For the human, this corresponds approximately to GW23, as the major folds such as the Rolandic fissure (central sulcus), arise during this developmental stage. For the macaque brain, we chose developmental stage G85, primarily because of the availability of the dataset corresponding to this time, which also corresponds to the least folded.

      We expect that large-scale folding patterns are strongly sensitive to the initial geometry but fine-scale features are not. Since our goal is to explain the large-scale features, we expect sensitivity to the initial shape.

      Below are some references of other researchers that are consistent with this idea. Figure 4 from Wang et al. shows some images of simulations obtained by perturbing the geometry of a sphere to an ellipsoid. We see that the growth-induced folds mostly maintain their width (wavelength), but change their orientations.

      Reference:

      Wang, X., Lefévre, J., Bohi, A., Harrach, M.A., Dinomais, M. and Rousseau, F., 2021. The influence of biophysical parameters in a biomechanical model of cortical folding patterns. Scientific Reports, 11(1), p.7686.

      Related results from the same group show that slight perturbations of brain geometry, cause these folds also tend to change their orientations but not width/wavelength (Bohi et al., 2019).

      Reference:

      Bohi, A., Wang, X., Harrach, M., Dinomais, M., Rousseau, F. and Lefévre, J., 2019, July. Global perturbation of initial geometry in a biomechanical model of cortical morphogenesis. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 442-445). IEEE.

      Finally, a systematic discussion of the role of perturbations on the initial geometries and physical properties can be seen in our work on understanding a different system, gut morphogenesis (Gill et al., 2024).

      We have added the discussion about geometric sensitivity in the section Methods-Numerical Simulations:

      “Small perturbations on initial geometry would affect minor folds, but the main features of major folds, such as orientations, width, and depth, are expected to be conserved across individuals [49, 50]. For simplicity, we do not perturb the fetal brain geometry obtained from datasets.”

      (2) Parameter Space and Breakdown Points

      The numerical model assumes homogeneous growth profiles and simplifies several aspects of cortical mechanics. Parameters such as cortical thickness, modulus ratios, and growth ratios are described in Table II. It would be informative to discuss the range of parameter values for which the model remains valid, and under what conditions the physical and computational models diverge. This would help delineate the boundaries of the current modelling framework and indicate directions for refinement.

      Exploring the valid parameter space is a key problem. We have tested a series of growth parameters and will state them explicitly in our revision. In the current version, we chose the ones that yield a relatively high similarity index to the animal brains. More generally, folding patterns are largely regulated by geometry as well as physical parameters, such as cortical thickness, modulus ratios, growth ratios, and inhomogeneity. In our previous work on a different system, gut morphogenesis, where similar folding patterns are seen, we have explored these features (Gill et al., 2024).

      Reference:

      Gill, H.K., Yin, S., Nerurkar, N.L., Lawlor, J.C., Lee, C., Huycke, T.R., Mahadevan, L. and Tabin, C.J., 2024. Hox gene activity directs physical forces to differentially shape chick small and large intestinal epithelia. Developmental Cell, 59(21), pp.2834-2849.

      (3) Neglected Regional Features: The Occipital Pole of the Macaque

      One conspicuous omission is the lack of attention to the occipital pole of the macaque, which is known to remain smooth even at later gestational stages and has an unusually high neuronal density (2.5× higher than adjacent cortex). This feature is not reproduced in the gel or numerical models, nor is it discussed. Acknowledging this discrepancy-and speculating on possible developmental or mechanical explanationswould add depth to the comparative analysis. The authors may wish to include this as a limitation or a target for future work.

      Yes, we have added that the omission of the Occipital Pole of the macaque is one of our paper’s limitations. Our main aim in this paper is to explore the formation of large-scale folds, so the smooth region is not discussed. But future work could include this to make the model more complete.

      The main text has been modified in Methods, Numerical simulations:

      “To focus on fold formation, we did not discuss the relatively smooth region, such as the Occipital Pole of the macaque.”

      and also in the caption of Figure 4: “... The occipital pole region of macaque brains remains smooth in real and simulated brains.”

      (4) Spatio-Temporal Growth Rates and Available Human Data

      The authors note that accurate, species-specific spatio-temporal growth data are lacking, limiting the ability to model inhomogeneous cortical expansion. While this may be true for ferret and macaque, there are high-quality datasets available for human fetal development, now extended through ultrasound imaging (e.g., https://doi.org/10.1038/s41586-023-06630-3). Incorporating or at least referencing such data could improve the fidelity of the human model and expand the applicability of the approach to clinical or pathological scenarios.

      We thank the reviewer for pointing out the very useful datasets that exist for the exploration of inhomogeneous growth driven folding patterns. We have referred to this paper to provide suggestions for further work in exploring the role of growth inhomogeneities.

      We have referred to this high-quality dataset in our main text, Discussion:

      “...the effect of inhomogeneous growth needs to be further investigated by incorporating regional growth of the gray and white matter not only in human brains [29, 31] based on public datasets [45], but also in other species.”

      A few works have tried to incorporate inhomogeneous growth in simulating human brain folding by separating the central sulcus area into several lobes (e.g., lobe parcellation method, Wang, PhD Thesis, 2021). Since our goal in this paper is to explain the large-scale features of folding in a minimal setting, we have kept our model simple and show that it is still capable of capturing the main features of folding in a range of mammalian brains.

      Reference:

      Xiaoyu Wang. Modélisation et caractérisation du plissement cortical. Signal and Image Processing. Ecole nationale superieure Mines-Télécom Atlantique, 2021. English. 〈NNT : 2021IMTA0248〉.

      (5) Future Applications: The Inverse Problem and Fossil Brains

      The authors suggest that their morphometric framework could be extended to solve the inverse growth problem-reconstructing fetal geometries from adult brains. This speculative but intriguing direction has implications for evolutionary neuroscience, particularly the interpretation of fossil endocasts. Although beyond the scope of this paper, I encourage the authors to elaborate briefly on how such a framework might be practically implemented and validated.

      For the inverse problem, we could use the following strategies:

      a. Perform systematic simulations using different geometries and physical parameters to obtain the variation in morphologies as a function of parameters.

      b. Using either supervised training or unsupervised training (physics-informed neural networks, PINNs) to learn these characteristic morphologies and classify their dependence on the parameters using neural networks. These can then be trained to determine the possible range of geometrical and physical parameters that yield buckled patterns seen in the systematic simulations.

      c. Reconstruct the 3D surface from fossil endocasts. Using the well-trained neural network, it should be possible to predict the initial shape of the smooth brain cortex, growth profile, and stiffness ratio of the gray and white matter.

      As an example in this direction, supervised neural networks have been used recently to solve the forward problem to predict the buckling pattern of a growing two-layer system (Chavoshnejad et al., 2023). The inverse problem can then be solved using machine-learning methods when the training datasets are the folded shape, which are then used to predict the initial geometry and physical properties.

      Reference:

      Chavoshnejad, P., Chen, L., Yu, X., Hou, J., Filla, N., Zhu, D., Liu, T., Li, G., Razavi, M.J. and Wang, X., 2023. An integrated finite element method and machine learning algorithm for brain morphology prediction. Cerebral Cortex, 33(15), pp.9354-9366.

      Conclusion

      This is a well-executed and creative study that integrates diverse methodologies to address a longstanding question in developmental neurobiology. While a few aspects-such as regional folding peculiarities, sensitivity to initial conditions, and available human data-could be further elaborated, they do not detract from the overall quality and novelty of the work. I enthusiastically support this paper and believe that it will be of broad interest to the neuroscience, biomechanics, and developmental biology communities.

      Note: The paper mentions a companion paper [reference 11] that explores the cellular and anatomical changes in the ferret cortex. I did not have access to this manuscript, but judging from the title, this paper might further strengthen the conclusions.

      The companion paper (Choi et al., 2025) has also been submitted to eLife and can be found here:

      G. P. T. Choi, C. Liu, S. Yin, G. Séjourné, R. S. Smith, C. A. Walsh, L. Mahadevan, Biophysical basis for brain folding and misfolding patterns in ferrets and humans. eLife, 14, RP107141, 2025. doi:10.7554/eLife.107141

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      This study was conducted and presented to the highest methodological standards. It is clearly written, and the results are thoroughly presented in the main manuscript and supplementary materials. Nevertheless, I would present the following minor points and comments for consideration by the authors prior to finalizing their work:

      We thank the reviewer for positive opinions and helpful comments.

      (1) Where did the MRI-based cortical surface data come from? Specifically, it would be helpful to include more information regarding whether the surfaces were reconstructed based on individual- or group-level data. It appears the surfaces were group-level, and, if so, accounting for individual-level cortical folding may be a fruitful direction for future work.

      The surface data come from public database, which are stated in the Methods Section. “We used a publicly available database for all our 3d reconstructions: fetal macaque brain surfaces are obtained from Liu et al. (2020); newborn ferret brain surfaces are obtained from Choi et al. (2025); and fetal human brain surfaces are obtained from Tallinen et al. (2016).”

      These surfaces are reconstructed based on group-level data. Specifically, the macaque atlas images are constructed for brains at gestational ages of 85 days (G85, N \=18_, 9 females), 110 days (G110, _N \=10_, 7 females) and 135 days (G135, _N \=16_,_ 7 females). And yes, future work may focus on individual-level cortical folding, and we expect that more specific results could be found.

      (2) One methodological approach for assessing consistency of cortical folding within species might be an evaluation of cross-hemispheric symmetry. I would find this particularly interesting with respect to the gel models, as it could complement the quantification of variation with respect to the computationally derived and real surfaces.

      Yes, the cross-hemispheric symmetry comparison can be done by our morphometric analysis method. We have added the results of ferret brain’s left-right symmetry for gel models, simulations, and real surfaces in the supplementary material. A typical conformal mapping figure and the similarity index table are shown here.

      (3) Was there a specific reason to reorder the histogram plots in Figure 4c to macaque, ferret, human rather than to maintain the order presented in Figure 4a/b of ferret, macaque, human? I appreciate that this is a minor concern, and all subplots are indeed properly titled, but consistent order may improve clarity.

      We have reordered the histogram plots to make all the figure orders consistent.

      Reviewer #2 (Recommendations for the authors):

      (1) Please consider revising the caption of Figure 1 (or equivalent figures) to explicitly state whether features such as the macaque occipital flatness were reproduced or not.

      We thank the reviewer for pointing out the macaque occipital flatness.

      Author response table 1.

      Left-right similarity index evaluated by comparing the shape index of ferret brains, calculated with vector P-NORM p\=2,

      Author response image 1.

      Left-right similarity index of ferret brains

      Occipital Pole of the macaque remains relatively smooth in both real brains and computational models. But our main aim in this paper is to explore the large-scale folds formation, so the smooth region is not discussed in depth. But future work could include this to make the model more complete.

      (2) Some figures could benefit from clearer labelling to distinguish between in vivo, in vitro, and in silico results.

      We have supplemented some texts in panels to make the labelling clearer.

      (3) The manuscript would benefit from a short paragraph in the Discussion reflecting on how future incorporation of regional heterogeneities might improve model fidelity.

      We have added a sentence in the Discussion Section about improving the model fidelity by considering regional heterogeneities.

      “Future more accurate models incorporating spatio-temporal inhomogeneous growth profiles and mechanical properties, such as varying stiffness, would make the folding pattern closer to the real cortical folding. This relies on more in vivo measurements of the brain’s physical properties and cortical expansion.”

      (4) Suggestions for improved or additional experiments, data, or analyses.

      (5) Clarify and justify the selection of developmental stages: The authors should explain why particular gestational stages (e.g., G85 for macaque, GW23 for human) were chosen as starting points for the physical and computational models. A discussion of how sensitive the folding patterns are to the initial geometry would help assess the robustness of the model. If feasible, a brief sensitivity analysis-varying initial age or surface geometry-would strengthen the conclusions.

      The initial geometry is one of the important factors that decides the final folding pattern. The smooth brain in the early developmental stage shows a broad consistency across individuals, and we expect the main folds to form similarly across species and individuals.

      Generally, we choose the initial geometry when the brain cortex is still relatively smooth. For the human, this corresponds approximately to GW23, as the major folds such as the Rolandic fissure (central sulcus), arise during this developmental stage. For the macaque brain, we chose developmental stage G85, primarily because of the availability of the dataset corresponding to this time, which also corresponds to the least folded.

      We expect that large-scale folding patterns are strongly sensitive to the initial geometry but fine-scale features are not. Since our goal is to explain the large-scale features, we expect sensitivity to the initial shape.

      We have added the discussion about geometric sensitivity in the section Methods-Numerical Simulations: “Small perturbations on initial geometry would affect minor folds, but the main features of major folds, such as orientations, width, and depth, are expected to be conserved across individuals [49, 50]. For simplicity, we do not perturb the fetal brain geometry obtained from datasets.”

      (6) Explore parameter boundaries more explicitly: The paper would benefit from a clearer account of the ranges of mechanical and geometric parameters (e.g., growth ratios, cortical thickness) for which the model holds. Are there specific conditions under which the physical and numerical models diverge? Identifying breakdown points would help readers understand the model’s limitations and applicability.

      Exploring the valid parameter space is a key problem. We have tested a series of growth parameters and will state them explicitly in our revision. In the current version, we chose the ones that yield a relatively high similarity index to the animal brains. More generally, folding patterns are largely regulated by geometry as well as physical parameters, such as cortical thickness, modulus ratios, and growth ratios and inhomogeneity. In our previous work on a different system, gut morphogenesis, where similar folding patterns are seen, we have explored these features (Gill et al., 2024).

      (7) Address species-specific cortical peculiarities: A striking omission is the flat occipital pole of the macaque, which is not reproduced in the physical or computational models. Given its known anatomical and cellular distinctiveness, this discrepancy warrants discussion. Even if not explored experimentally, the authors could speculate on what developmental or mechanical conditions would be needed to reproduce such regional smoothness.

      Please refer to our answer to the public reviewer 2, question (3). From our results, the formation of smooth Occipital Pole might indicate that the spatio-temporal growth rate of gray and white matter are consistent in this region, such that there’s no much differential growth.

      (8) Consider integration of available human growth data: While the authors note the lack of spatiotemporal growth data across species, such datasets exist for human fetal brain development, including those from MRI and ultrasound studies (e.g., Nature 2023). Incorporating these into the human model-or at least discussing their implications-would enhance biological relevance.

      Yes, some datasets for fetal human brains have provided very comprehensive measurements on brain shapes at many developmental stages. This can surely be implemented in our current model by calculating the spatio-temporal growth rate from regional cortical shapes at different stages.

      (9) Recommendations for improving the writing and presentation:

      a) The manuscript is generally well-written, but certain sections would benefit from more explicit linksbetween the biological phenomena and the modeling framework. For instance, the Introduction and Discussion could more clearly articulate how mechanical principles interface with genetic or cellular processes, especially in the context of evolution and developmental variation.

      We have briefly discussed the gene-regulated cellular process and the induced changes of mechanical properties and growth rules in SI, table S1. In the main text, to be clearer, we have added a sentence:

      “Many malformations are related to gene-regulated abnormal cellular processes and mechanical properties, which are discussed in SI”

      b) The Discussion could better acknowledge limitations and future directions, including regional dif-ferences in folding, inter-individual variability, and the model’s assumptions of homogeneous material properties and growth.

      In the discussion section, we have pointed out four main limitations and open directions based on our current model, including the discussion on spatiotemporal growth and property. To be more complete, we have supplemented other limitations on the regional differences in folding and the interindividual variability. In the main text, we added the following sentence:

      “In addition to the homogeneity assumption, we have not investigated the inter-individual variability and regional differences in folding. More accurate and specific work is expected to focus on these directions.”

      c) The authors briefly mention the potential for addressing the inverse growth problem. Expanding this idea in a short paragraph - perhaps with hypothetical applications to fossil brain reconstructions-would broaden the paper’s appeal to evolutionary neuroscientists.

      We have stated general steps in the response to public reviewer 2, question (5).

      (10) Minor corrections to the text and figures:

      a) Figures:

      Label figures more clearly to distinguish between in vivo, in vitro, and in silico brain representations.– Ensure that the occipital pole of the macaque is visible or annotated, especially if it lacks the expected smoothness.

      Add scale bars where missing for clarity in morphometric comparisons.

      We thank the reviewer for suggestions to improve the readability of our manuscript.

      The in vivo (real), in vitro (gel), and in silico (simulated) results are both distinguished by their labels and different color scheme: gray-white for real brain, pink-white for gel model, and blue-white for simulations, respectively.

      The occipital pole of the macaque brain remains relatively smooth in our computational model but notin our physical gel model. We have clarified this in the main text: “To focus on fold formation, we did not discuss the relatively smooth region, such as the Occipital Pole of the macaque.”

      All the brain models are rescaled to the same size, where the distance between the anterior-most pointof the frontal lobe and the posterior-most point of the occipital lobe is two units.

      b) Text:

      Consider revising figure captions to explicitly mention whether specific regional features (e.g., flatoccipital pole) were observed or absent in models.

      In Table II (and relevant text), ensure parameter definitions are consistent and explained clearly for across-disciplinary audience.

      Add citations to recent human fetal growth imaging work (e.g., ultrasound-based studies) to support claims about available data.

      We have added some descriptions of the characters of the folding pattern in the caption of Figure 4,including major folds and smooth regions.

      “Three or four major folds of each brain model are highlighted and served as landmarks. The occipital pole region of macaque brains remains smooth in real and simulated brains.”

      We have clarified the definition of growth ratio gMsub>g</sub>/g<sub>w</sub> and stiffness ratio µ<sub>g</sub>/µ<sub>w</sub> between gray matter and white matter, and the normalized cortical thickness h/L in Table 2.

      We have referred to a high-quality dataset of fetal brain imaging work, the ultrasound-imaging method(Namburete et al. 2023), in our main text, Discussion:

      “...the effect of inhomogeneous growth needs to be further investigated by incorporating regional growth of the gray and white matter not only in human brains [29, 31] based on public datasets [45], but also in other species.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Lack of Sensitivity Analyses for some Key Methodological Decisions: Certain methodological choices in this manuscript diverge from approaches used in previous works. In these cases, I recommend the following: (i) The authors could provide a clear and detailed justification for these deviations from established methods, and (ii) supplementary sensitivity analyses could be included to ensure the robustness of the findings, demonstrating that the results are not driven primarily by these methodological changes. Below, I outline the main areas where such evaluations are needed:

      This detailed guidance is incredibly valuable, and we are grateful. Work of this kind is in its relative infancy, and there are so many design choices depending on the data available, questions being addressed, and so on. Help us navigate that has been extremely useful. In our revised manuscript we are very happy to add additional justification for design choices made, and wherever possible test the impact of those choices. It is certainly the case that different approaches have been used across the handful of papers published in this space, and, unlike in other areas of systems neuroscience, we have yet to reach the point where any of these approaches are established. We agree with the reviewer that wherever possible these design choices should be tested. 

      Use of Communicability Matrices for Structural Connectivity Gradients: The authors chose to construct structural connectivity gradients using communicability matrices, arguing that diffusion map embedding "requires a smooth, fully connected matrix." However, by definition, the creation of the affinity matrix already involves smoothing and ensures full connectedness. I recommend that the authors include an analysis of what happens when the communicability matrix step is omitted. This sensitivity test is crucial, as it would help determine whether the main findings hold under a simpler construction of the affinity matrix. If the results significantly change, it could indicate that the observations are sensitive to this design choice, thereby raising concerns about the robustness of the conclusions. Additionally, if the concern is related to the large range of weights in the raw structural connectivity (SC) matrix, a more conventional approach is to apply a log-transformation to the SC weights (e.g., log(1+𝑆𝐶<sub>𝑖𝑗</sub>)), which may yield a more reliable affinity matrix without the need for communicability measures.

      The reason we used communicability is indeed partly because we wanted to guarantee a smooth fully connected matrix, but also because our end goal for this project was to explore structure-function coupling in these low-dimensional manifolds.  Structural communicability – like standard metrics of functional connectivity – includes both direct and indirect pathways, whereas streamline counts only capture direct communication. In essence we wanted to capture not only how information might be routed from one location to another, but also the more likely situation in which information propagates through the system. 

      In the revised manuscript we have given a clearer justification for why we wanted to use communicability as our structural measure (Page 4, Line 179):

      “To capture both direct and indirect paths of connectivity and communication, we generated weighted communicability matrices using SIFT2-weighted fibre bundle capacity (FBC). These communicability matrices reflect a graph theory measure of information transfer previously shown to maximally predict functional connectivity (Esfahlani et al., 2022; Seguin et al., 2022). This also foreshadowed our structure-function coupling analyses, whereby network communication models have been shown to increase coupling strength relative to streamline counts (Seguin et al., 2020)”.

      We have also referred the reader to a new section of the Results that includes the structural gradients based on the streamline counts (Page 7, line 316):

      “Finally, as a sensitivity analysis, to determine the effect of communicability on the gradients, we derived affinity matrices for both datasets using a simpler measure: the log of raw streamline counts. The first 3 components derived from streamline counts compared to communicability were highly consistent across both NKI  (r<sub>s</sub> = 0.791, r<sub>s</sub> = 0.866, r<sub>s</sub> = 0.761) and the referred subset of CALM (r<sub>s</sub> = 0.951, r<sub>s</sub> = 0.809, r<sub>s</sub> = 0.861), suggesting that in practice the organisational gradients are highly similar regardless of the SC metric used to construct the affinity matrices”. 

      Methodological ambiguity/lack of clarity in the description of certain evaluation steps: Some aspects of the manuscript’s methodological description are ambiguous, making it challenging for future readers to fully reproduce the analyses based on the information provided. I believe the following sections would benefit from additional detail and clarification:

      Computation of Manifold Eccentricity: The description of how eccentricity was computed (both in the results and methods sections) is unclear and may be problematic. The main ambiguity lies in how the group manifold origin was defined or computed. (1) In the results section, it appears that separate manifold origins were calculated for the NKI and CALM groups, suggesting a dataset-specific approach. (2) Conversely, the methods section implies that a single manifold origin was obtained by somehow combining the group origins across the three datasets, which seems contradictory. Moreover, including neurodivergent individuals in defining the central group manifold origin in conceptually problematic. Given that neurodivergent participants might exhibit atypical brain organization, as suggested by Figure 1, this inclusion could skew the definition of what should represent a typical or normative brain manifold. A more appropriate approach might involve constructing the group manifold origin using only the neurotypical participants from both the NKI and CALM datasets. Given the reported similarity between group-level manifolds of neurotypical individuals in CALM and NKI, it would be reasonable to expect that this combined origin should be close to the origin computed within neurotypical samples of either NKI or CALM. As a sanity check, I recommend reporting the distance of the combined neurotypical manifold origin to the centres of the neurotypical manifolds in each dataset. Moreover, if the manifold origin was constructed while utilizing all samples (including neurodivergent samples) I think this needs to be reconsidered. 

      This is a great point, and we are very happy to clarify. Separate manifolds were calculated for the NKI and CALM participants, hence a dataset-specific approach. Indeed, in the long-run our goal was to explore individual differences in these manifolds, relative to the respective group-level origins, and their intersection across modalities, so manifold eccentricity was calculated at an individual level for subsequent analyses. At the group level, for each modality, we computed 3 manifold origins: one for NKI, one for the referred subset of CALM, and another for the neurotypical portion of CALM. Crucially, because the manifolds are always normal, in each case the manifold origin point is near-zero (extremely near-zero, to the 6<sup>th</sup> or 7<sup>th</sup> decimal place). In other words, we do indeed calculate the origin separately each time we calculate the gradients, but the origin is zero in every case. As a result, differences in the origin point cannot be the source of any differences we observe in manifold eccentricity between groups or individuals. We have updated the Methods section with the manifold origin points for each dataset and clarified our rationale (Page 16, Line 1296):

      “Note that we used a dataset-specific approach when we computed manifold eccentricity for each of the three groups relative to their group-level origin: neurotypical CALM (SC origin = -7.698 x 10<sup>-7</sup>, FC origin = 6.724 x 10<sup>-7</sup>), neurodivergent CALM (SC origin = -6.422 x 10 , FC origin = 1.363 x 10 ), and NKI (SC origin = -7.434 x 10 , FC origin = 4.308 x 10<sup>-6</sup>). Eccentricity is a relative measure and thus normalised relative to the origin. Because of this normalisation, each time gradients are constructed the manifold origin is necessarily near-zero, meaning that differences in manifold eccentricity of individual nodes, either between groups or individuals, are stem from the eccentricity of that node rather than a difference in origin point”. 

      We clarified the computation of the respective manifold origins within the Results section, and referred the reader to the relevant Methods section (Page 9, line 446):

      “For each modality (2 levels: SC and FC) and dataset (3 levels: neurotypical CALM, neurodivergent CALM, and NKI), we computed the group manifold origin as the mean of their respective first three gradients. Because of the normal nature of the manifolds this necessarily means that these origin points will be very near-zero, but we include the exact values in the ‘Manifold Eccentricity’ methodology sub-section”. 

      Individual-Level Gradients vs. Group-Level Gradients: Unlike previous studies that examined alterations in principal gradients (e.g., Xia et al., 2022; Dong et al., 2021), this manuscript focuses on gradients derived directly from individual-level data. In contrast, earlier works have typically computed gradients based on grouped data, such as using a moving window of individuals based on age (Xia et al.) or evaluating two distinct age groups (Dong et al.). I believe it is essential to assess the sensitivity of the findings to this methodological choice. Such an evaluation could clarify whether the observed discrepancies with previous reports are due to true biological differences or simply a result of different analytical strategies.

      This is a brilliant point. The central purpose of our project was to test how individual differences in these gradients, and their intersection across modalities, related to differences in phenotype (e.g. cognitive difficulties). These necessitated calculating gradients at the level of individuals and building a pipeline to do so, given that we could find no other examples. Nonetheless, despite this different goal and thus approach, we had expected to replicate a couple of other key findings, most prominently the ‘swapping’ of gradients shown by Dong et al. (2021). We were also surprised that we did not find this changing in order. The reviewer is right and there could be several design features that produce the difference, and in the revised manuscript we test several of them. We have added the following text to the manuscript as a sensitivity analysis for the Results sub-section titled “Stability of individual-level gradients across developmental time” (Page 7, Line 344 onwards):

      “One possibility is that our observation of gradient stability – rather than a swapping of the order for the first two gradients (Dong et al., 2021) – is because we calculated them at an individual level. To test this, we created subgroups and contrasted the first two group-level structural and functional gradients derived from children (younger than 12 years old) versus those from adolescents (12 years old and above), using the same age groupings as prior work (Dong et al., 2021). If our use of individually calculated gradients produces the stability, then we should observe the swapping of gradients in this sensitivity analysis. Using baseline scans from NKI, the primary structural gradient in childhood (N = 99) as shown in Figure 1f, this was highly correlated (r<sub>s</sub> = 0.995) with those derived from adolescents (N = 123). Likewise, the secondary structural gradient in childhood was highly consistent in adolescence (r<sub>s</sub> = 0.988). In terms of functional connectivity, the principal gradient in childhood (N = 88) was highly consistent in adolescence (r<sub>s</sub> = 0.990, N = 125). The secondary gradient in childhood was again highly similar in adolescence (r<sub>s</sub> = 0.984). The same result occurred in the CALM dataset: In the baseline referred subset of CALM, the primary and secondary communicability gradients derived from children (N = 258) and adolescents (N = 53) were near-identical (r<sub>s</sub> = 0.991 and r<sub>s</sub> = 0.967, respectively). Alignment for the primary and secondary functional gradients derived from children (N = 130) and adolescents (N = 43) were also near-identical (r<sub>s</sub> = 0.972 and r<sub>s</sub> = 0.983, respectively). These consistencies across development suggest that gradients of communicability and functional connectivity established in childhood are the same as those in adolescence, irrespective of group-level or individual-level analysis. Put simply, our failure to replicate the swapping of gradient order in Dong et al. (2021) is not the result of calculating gradients at the level of individual participants.”

      Procrustes Transformation: It is unclear why the authors opted to include a Procrustes transformation in this analysis, especially given that previous related studies (e.g., Dong et al.) did not apply this step. I believe it is crucial to evaluate whether this methodological choice influences the results, particularly in the context of developmental changes in organizational gradients. Specifically, the Procrustes transformation may maximize alignment to the group-level gradients, potentially masking individual-level differences. This could result in a reordering of the gradients (e.g., swapping the first and second gradients), which might obscure true developmental alterations. It would be informative to include an analysis showing the impact of performing vs. omitting the Procrustes transformation, as this could help clarify whether the observed effects are robust or an artifact of the alignment procedure. (Please also refer to my comment on adding a subplot to Figure 1). Additionally, clarifying how exactly the transformation was applied to align gradients across hemispheres, individuals, and/or datasets would help resolve ambiguity. 

      The current study investigated individual differences in connectome organisation, rather than group-level trends (Dong et al., 2021). This necessitates aligning individual gradients to the corresponding group-level template using a Procrustes rotation. Without a rotation, there is no way of knowing if you are comparing  ‘like with like’: the manifold eccentricity of a given node may appear to change across individuals simply due to subtle differences in the arbitrary orientation of the underlying manifolds. We also note that prior work examining individual differences in principal alignment have used Procrustes (Xia et al., 2022), who demonstrated emergence of the principal gradient across development, albeit with much smaller effects than Dong and colleagues (2021). Nonetheless, we agree, the Procrustes rotation could be another source of the differences we observed with the previous paper (Dong et al. 2021). We explored the impact of the Procrustes rotation on individual gradients as our next sensitivity analysis. We recalculated everyone’s gradients without Procrustes rotation. We then tested the alignment of each participant with the group-level gradients using Spearman’s correlations, followed by a series of generalised linear models to predict principal gradient alignment using head motion, age, and sex. The expected swapping of the first and second functional gradient (Dong et al., 2021) would be represented by a decrease in the spatial similarity of each child’s principal functional gradient to the principal childhood group-level gradient, at the onset of adolescence (~age 12). However, there is no age effect on this unrotated alignment, suggesting that the lack of gradient swapping in our data does not appear to be the result of the Procrustes rotation. When you use unrotated individual gradients the alignment is remarkably consistent across childhood and adolescence. Alignment is, however, related to head motion, which is often related to age. To emphasise the importance of motion, particularly in relation to development, we conducted a mediation analysis between the relationship between age and principal alignment (without correcting for motion), with motion as a mediator, within the NKI dataset. Before accounting for motion, the relationship between age and principal alignment is significant, but this can be entirely accounted for by motion. In our revised manuscript we have included this additional analysis in the Results sub-section titled “Stability of individual-level gradients across developmental time”, following on from the above point about the effect of group-level versus individual-level analysis (Page 8, Line 400):

      “A second possible discrepancy between our results and that of prior work examining developmental change in group-level functional gradients (Dong et al., 2021) was the use of Procrustes alignment. Such alignment of individual-level gradients to group-level templates is a necessary step to ensure valid comparisons between corresponding gradients across individuals, and has been implemented in sliding-window developmental work tracking functional gradient development (Xia et al., 2022). Nonetheless, we tested whether our observation of stable principal functional and communicability gradients may be an artefact of the Procrustes rotation. We did this by modelling how individual-level alignment without Procrustes rotation to the group-level templates varies with age, head motion, and sex, as a series of generalised linear models. We included head motion as the magnitude of the Procrustes rotation has been shown to be positively correlated with mean framewise displacement (Sasse et al., 2024), and prior group-level work (Dong et al., 2021) included an absolute motion threshold rather than continuous motion estimates. Using the baseline referred CALM sample, there was no significant relationship between alignment and age (β = -0.044, 95% CI = [-0.154, 0.066], p = 0.432) after accounting for head motion and sex. Interestingly, however head motion was significantly associated with alignment ( β = -0.318, 95% CI = [-0.428, -.207], p = 1.731 x 10<sup>-8</sup>), such that greater head motion was linked to weaker alignment. Note that older children tended to have exhibit less motion for their structural scans (r<sub>s</sub> = 0.335, p < 0.001). We observed similar trends in functional alignment, whereby tighter alignment was significantly predicted by lower head motion (β = -0.370, 95% CI = [-0.509, -0.231], p = 1.857 x 10<sup>-7</sup>), but not by age (β= 0.049, 95% CI = [-0.090, 0.187], p = 0.490). Note that age and head motion for functional scans were not significantly related (r<sub>s</sub> = -0.112, p = 0.137). When repeated for the baseline scans of NKI, alignment with the principal structural gradient was not significantly predicted by either scan age (β = 0.019, 95% CI = [-0.124, 0.163], p = 0.792) or head motion (β = -0.133, 95% CI = [-0.175, 0.009], p = 0.067) together in a single model, where age and motion were negatively correlated (r<sub>s</sub> = -0.355, p < 0.001). Alignment with the principal functional gradient was significantly predicted by head motion (β = -0.183, 95% CI = [-0.329, -0.036], p = 0.014) but not by age (β= 0.066, 95% CI = [-0.081, 0.213], p = 0.377), where age and motion were also negatively correlated (r<sub>s</sub> = -0.412, p < 0.001). Across modalities and datasets, alignment with the principal functional gradient in NKI was the only example in which there was a significant correlation between alignment and age (r<sub>s</sub> = 0.164, p = 0.017) before accounting for head motion and sex. This suggests that apparent developmental effects on alignment are minimal, and where they do exist they are removed after accounting for head motion. Put together this suggests that the lack of order swapping for the first two gradients is not the result of the Procrustes rotation – even without the rotation there is no evidence for swapping”.

      “To emphasise the importance of head motion in the appearance of developmental change in alignment, we examined whether accounting for head motion removes any apparent developmental change within NKI. Specifically, we tested whether head motion mediates the relationship between age and alignment (Figure 1X), controlling for sex, given that higher motion is associated with younger children (β= -0.429, 95% CI = [0.552, -0.305], p = 7.957 x 10<sup>-11</sup>), and stronger alignment is associated with reduced motion (β = -0.211, 95% CI = [-0.344, -0.078], p = 2.017 x 10<sup>-3</sup>). Motion mediated the relationship between age and alignment (β = 0.078, 95% CI = [0.006, 0.146], p = 1.200 x 10<sup>-2</sup>), accounting for 38.5% variance in the age-alignment relationship, such that the link between age and alignment became non-significant after accounting for motion (β = 0.066, 95% CI = [-0.081, 0.214], p = 0.378). This firstly confirms our GLM analyses, where we control for motion and find no age associations. Moreover, this suggests that caution is required when associations between age and gradients are observed. In our analyses, because we calculate individual gradients, we can correct for individual differences in head motion in all our analyses. However, other than using an absolute motion threshold and motion-matched child and adolescent groups, individual differences in motion were not accounted for by prior work which demonstrated a flipping of the principal functional gradients with age (Dong et al., 2021)”. 

      We further clarify the use of Procrustes rotation as a separate sub-section within the Methods (Page 25, Line 1273):

      “Procrustes Rotation

      For group-level analysis, for each hemisphere we constructed an affinity matrix using a normalized angle kernel and applied diffusion-map embedding. The left hemisphere was then aligned to the right using a Procrustes rotation. For individual-level analysis, eigenvectors for the left hemisphere were aligned with the corresponding group-level rotated eigenvectors. No alignment was applied across datasets. The only exception to this was for structural gradients derived from the referred CALM cohort. Specifically, we aligned the principal gradient of the left hemisphere to the secondary gradient of the right hemisphere: this was due to the first and second gradients explaining a very similar amount of variance, and hence their order was switched”. 

      SC-FC Coupling Metric: The approach used to quantify nodal SC-FC coupling in this study appears to deviate from previously established methods in the field. The manuscript describes coupling as the "Spearman-rank correlation between Euclidean distances between each node and all others within structural and functional manifolds," but this description is unclear and lacks sufficient detail. Furthermore, this differs from what is typically referred to as SC-FC coupling in the literature. For instance, the cited study by Park et al. (2022) utilizes a multiple linear regression framework, where communicability, Euclidean distance, and shortest path length are independent variables predicting functional connectivity (FC), with the adjusted R-squared score serving as the coupling index for each node. On the other hand, the Baum et al. (2020) study, also cited, uses Spearman correlation, but between raw structural connectivity (SC) and FC values. If the authors opt to introduce a novel coupling metric, it is essential to demonstrate its similarity to these previous indices. I recommend providing an analysis (supplementary) showing the correlation between their chosen metric and those used in previous studies (e.g., the adjusted R-squared scores from Park et al. or the SC-FC correlation from Baum et al.). Furthermore, if the metrics are not similar and results are sensitive to this alternative metric, it raises concerns about the robustness of the findings. A sensitivity analysis would therefore be helpful (in case the novel coupling metric is not like previous ones) to determine whether the reported effects hold true across different coupling indices.

      This is a great point, and we are happy to take the reviewer’s recommendation. There are multiple different ways of calculating structure-function coupling. For our set of questions, it was important that our metric incorporated information about the structural and functional manifolds, rather than being a separate approach that is unrelated to these low-dimensional embeddings. Put simply, we wanted our coupling measure to be about the manifolds and gradients outlined in the early sections of the results. We note that the multiple linear regression framework was developed by Vázquez-Rodríguez and colleagues (2019), whilst the structure-function coupling computed in manifold space by Park and colleagues (2022) was operationalised as a linear correlation between z-transformed functional connectomes and structural differentiation eigenvectors. To clarify how this coupling was calculated, and to justify why we developed a new coupling method based on manifolds rather than borrow an existing approach from the literature, we have revised the manuscript to make this far clearer for readers (Page 13, line 604):

      “To examine the relationship between each node’s relative position in structural and functional manifold space, we turned our attention to structure-function coupling. Whilst prior work typically computed coupling using raw streamline counts and functional connectivity matrices, either as a correlation (Baum et al., 2020) or through a multiple linear regression framework (Vázquez-Rodríguez et al., 2019), we opted to directly incorporate low-dimensional embeddings within our coupling framework. Specifically, as opposed to correlating row-wise raw functional connectivity with structural connectivity eigenvectors (Park et al., 2022), our metric directly incorporates the relative position of each node in low-dimensional structural and functional manifold spaces. Each node was situated in a low-dimensional 3D space, the axes of which were each participant’s gradients, specific to each modality. For each participant and each node, we computed the Euclidean distance with all other nodes within structural and functional manifolds separately, producing a vector of size 200 x 1 per modality. The nodal coupling coefficient was the Spearman correlation between each node’s Euclidean distance to all other nodes in structural manifold space, and that in functional manifold space. Put simply, a strong nodal coupling coefficient suggests that that node occupies a similar location in structural space, relative to all other nodes, as it does in functional space”. 

      We also agree with the reviewer’s recommendation to compare this to some of the more standard ways of calculating coupling. We compare our metric with 3 others (Baum et al., 2020; Park et al., 2022; VázquezRodríguez et al., 2019), and find that all metrics capture the core developmental sensorimotor-to-association axis (Sydnor et al., 2021). Interestingly, manifold-based coupling measures captured this axis more strongly than non-manifold measures. We have updated the Results accordingly (Page 14, Line 638):

      “To evaluate our novel coupling metric, we compared its cortical spatial distribution to three others (Baum et al., 2020; Park et al., 2022; Vázquez-Rodríguez et al., 2019), using the group-level thresholded structural and functional connectomes from the referred CALM cohort. As shown in Figure 4c, our novel metric was moderately positively correlated to that of a multi-linear regression framework (r<sub>s</sub> = 0.494, p<sub>spin</sub> = 0.004; Vázquez-Rodríguez et al., 2019) and nodal correlations of streamline counts and functional connectivity (r<sub>s</sub> = 0.470, p<sub>spin</sub> = 0.005; Baum et al., 2020). As expected, our novel metric was strongly positively correlated to the manifold-derived coupling measure (r<sub>s</sub> = 0.661, p<sub>spin</sub> < 0.001; Park et al., 2022), more so than the first (Z(198) = 3.669, p < 0.001) and second measure (Z(198) = 4.012, p < 0.001). Structure-function coupling is thought to be patterned along a sensorimotor-association axis (Sydnor et al., 2021): all four metrics displayed weak-tomoderate alignment (Figure 4c). Interestingly, the manifold-based measures appeared most strongly aligned with the sensorimotor-association axis: the novel metric was more strongly aligned than the multi-linear regression framework (Z(198) = -11.564, p < 0.001) and the raw connectomic nodal correlation approach (Z(198) = -10.724, p < 0.001), but the previously-implemented structural manifold approach was more strongly aligned than the novel metric  (Z(198) = -12.242, p < 0.001). This suggests that our novel metric exhibits the expected spatial distribution of structure-function coupling, and the manifold approach more accurately recapitulates the sensorimotor-association axis than approaches based on raw connectomic measures”.

      We also added the following to the legend of Figure 4 on page 15:

      “d. The inset Spearman correlation plot of the 4 coupling measures shows moderate-to-strong correlations (p<sub>spin</sub> < 0.005 for all spatial correlations). The accompanying lollypop plot shows the alignment between the sensorimotor-to-association axis and each of the 4 coupling measures, with the novel measure coloured in light purple (p<sub>spin</sub> < 0.007 for all spatial correlations)”. 

      Prediction vs. Association Analysis: The term “prediction” is used throughout the manuscript to describe what appear to be in-sample association tests. This terminology may be misleading, as prediction generally implies an out-of-sample evaluation where models trained on a subset of data are tested on a separate, unseen dataset. If the goal of the analyses is to assess associations rather than make true predictions, I recommend refraining from the term “prediction” and instead clarifying the nature of the analysis. Alternatively, if prediction is indeed the intended aim (which would be more compelling), I suggest conducting the evaluations using a k-fold cross-validation framework. This would involve training the Generalized Additive Mixed Models (GAMMs) on a portion of the data and training their predictive accuracy on a held-out sample (i.e. different individuals). Additionally, the current design appears to focus on predicting SC-FC coupling using cognitive or pathological dimensions. This is contrary to the more conventional approach of predicting behavioural or pathological outcomes from brain markers like coupling. Could the authors clarify why this reverse direction of analysis was chosen? Understanding this choice is crucial, as it impacts the interpretation and potential implications of the findings. 

      We have replaced “prediction” with “association” across the manuscript. However, for analyses corresponding to Figure 5, which we believe to be the most compelling, we conducted a stratified 5-fold cross-validation procedure, outlined below, repeated 100 times to account for random variation in the train-test splits. To assess whether prediction accuracy in the test splits was significantly greater than chance, we compared our results to those derived from a null dataset in which cognitive factor 2 scores had been permuted across participants. To account for the time-series element and block design of our data, in that some participants had 2 or more observations, we permuted entire participant blocks of cognitive factor 2 scores, keeping all other variables, including covariates, the same. Included in our manuscript are methodological details and results pertaining to this procedure. Specifically, the following has been added to the Results (Page 16, Line 758):

      “To examine the predictive value of the second cognitive factor for global and network-level structure-function coupling, operationalised as a Spearman rank correlation coefficient, we implemented a stratified 5-fold crossvalidation framework, and predictive accuracy compared with that of a null data frame with cognitive factor 2 scores permuted across participant blocks (see ‘GAMM cross-validation’ in the Methods). This procedure was repeated 100 times to account for randomness in the train-test splits, using the same model specification as above. Therefore, for each of the 5 network partitions in which an interaction between the second cognitive factor and age was a significant predictor of structure-function coupling (global, visual, somato-motor, dorsal attention, and default-mode), we conducted a Welch’s independent-sample t-test to compare 500 empirical prediction accuracies with 500 null prediction accuracies. Across all 5 network partitions, predictive accuracy of coupling was significantly higher than that of models trained on permuted cognitive factor 2 scores (all p < 0.001). We observed the largest difference between empirical (M = 0.029, SD = 0.076) and null (M = -0.052, SD = 0.087) prediction accuracy in the somato-motor network [t (980.791) = 15.748, p < 0.001, Cohen’s d = 0.996], and the smallest difference between empirical (M = 0.080, SD = 0.082) and null (M = 0.047, SD = 0.081) prediction accuracy in the dorsal attention network [t (997.720) = 6.378, p < 0.001, Cohen’s d = 0.403]. To compare relative prediction accuracies, we ordered networks by descending mean accuracy and conducted a series of Welch’s independent sample t-tests, followed by FDR correction (Figure 5X). Prediction accuracy was highest in the default-mode network (M = 0.265, SD = 0.085), two-fold that of global coupling (t(992.824) = 25.777, p<sub>FDR</sub> = 5.457 x 10<sup>-112</sup>, Cohen’s d = 1.630, M = 0.131, SD = 0.079). Global prediction accuracy was significantly higher than the visual network (t (992.644) = 9.273, p<sub>FDR</sub> = 1.462 x 10<sup>-19</sup>, Cohen’s d = 0.586, M = 0.083, SD = 0.085), but visual prediction accuracy was not significantly higher than within the dorsal attention network (t (997.064) = 0.554, p<sub>FDR</sub> = 0.580, Cohen’s d = 0.035, M = 0.080, SD = 0.082). Finally, prediction accuracy within the dorsal attention network was significantly stronger than that of the somato-motor network [t (991.566) = 10.158, p<sub>FDR</sub> = 7.879 x 10<sup>-23</sup>, Cohen’s d = 0.642 M = 0.029, SD = 0.076]. Together, this suggests that out-of-sample developmental predictive accuracy for structure-function coupling, using the second cognitive factor, is strongest in the higher-order default-mode network, and lowest in the lower-order somatosensory network”. 

      We have added a separate section for GAMM cross-validation in the Methods (Page 27, Line 1361):

      GAMM cross-validation

      “We implemented a 5-fold cross validation procedure, stratified by dataset (2 levels: CALM or NKI). All observations from any given participant were assigned to either the testing or training fold, to prevent data leakage, and the cross-validation procedure was repeated 100 times, to account for randomness in data splits. The outcome was predicted global or network-level structure-function coupling across all test splits, operationalised as the Spearman rank correlation coefficient. To assess whether prediction accuracy exceeded chance, we compared empirical prediction accuracy with that of GAMMs trained and tested on null data in which cognitive factor 2 scores were permuted across subjects. The number of observations formed 3 exchangeability blocks (N = 320 with one observation, N = 105 with two observations, and N = 33 with three observations), whereby scores from a participant with two observations were replaced by scores from another participant with two observations, with participant-level scores kept together, and so on for all numbers of observations. We compared empirical and null prediction accuracies using independent sample t-tests as, although the same participants were examined, the shuffling meant that the relative ordering of participants within both distributions was not preserved. For parallelisation and better stability when estimating models fit on permuted data, we used the bam function from the mgcv R package (Wood, 2017)”. 

      We also added a justification for why we predicted coupling using behaviour or psychopathology, rather than vice versa (Page 27, Line 1349):

      “When using our GAMMs to test for the relationship between cognition and psychopathology and our coupling metrics, we opted to predict structure-function coupling using cognitive or psychopathological dimensions, rather than vice versa, to minimise multiple comparisons. In the current framework, we corrected for 8 multiple comparisons within each domain. This would have increased to 16 multiple comparison corrections for predicting two cognitive dimensions using network-level coupling, and 24 multiple comparison corrections for predicting three psychopathology dimensions. Incorporating multiple networks as predictors within the same regression framework introduces collinearity, whilst the behavioural dimensions were orthogonal: for example, coupling is strongly correlated between the somato-motor and ventral attention networks (r<sub>s</sub> = 0.721), between the default-mode and frontoparietal networks (r<sub>s</sub> = 0.670), and between the dorsal attention and fronto-parietal networks (r<sub>s</sub> = 0.650)”. 

      Finally, we noticed a rounding error in the ages of the data frame containing the structure-function coupling values and the cognitive/psychopathology dimensions. We rectified this and replaced the GAMM results, which largely remained the same. 

      In typical applications of diffusion map embedding, sparsification (e.g., retaining only the top 10  of the strongest connections) is often employed at the vertex-level resolution to ensure computational feasibility. However, since the present study performs the embedding at the level of 200 brain regions (a considerably coarser resolution), this step may not be necessary or justifiable. Specifically, for FC, it might be more appropriate to retain all positive connections rather than applying sparsification, which could inadvertently eliminate valuable information about lower-strength connections. Whereas for SC, as the values are strictly non-negative, retaining all connections should be feasible and would provide a more complete representation of the structural connectivity patterns. Given this, it would be helpful if the authors could clarify why they chose to include sparsification despite the coarser regional resolution, and whether they considered this alternative approach (using all available positive connections for FC and all non-zero values for SC). It would be interesting if the authors could provide their thoughts on whether the decision to run evaluations at the resolution of brain regions could itself impact the functional and structural manifolds, their alteration with age, and or their stability (in contrast to Dong et al. which tested alterations in highresolution gradients).

      This is another great point. We could retain all connections, but we usually implement some form of sparsification to reduce noise, particularly in the case of functional connectivity. But we nonetheless agree with the reviewer’s point. We should check what impact this is having on the analysis. In brief, we found minimal effects of thresholding, suggesting that the strongest connections are driving the gradient (Page 7, Line 304):

      “To assess the effect of sparsity on the derived gradients, we examined group-level structural (N = 222) and functional (N = 213) connectomes from the baseline session of NKI. The first three functional connectivity gradients derived using the full connectivity matrix (density = 92%) were highly consistent with those obtained from retaining the strongest 10% of connections in each row (r<sub>1</sub> = 0.999, r<sub>2</sub> = 0.998, r<sub>3</sub> < 0.999, all p < 0.001). Likewise, the first three communicability gradients derived from retaining all streamline counts (density = 83%) were almost identical to those obtained from 10% row-wise thresholding (r<sub>1</sub> = 0.994, r<sub>2</sub> = 0.963, r<sub>3</sub> = 0.955, all p < 0.001). This suggests that the reported gradients are driven by the strongest or most consistent connections within the connectomes, with minimal additional information provided by weaker connections. In terms of functional connectivity, such consistency reinforces past work demonstrating that the sensorimotor-toassociation axis, the major axis within the principal functional connectivity gradient, emerges across both the top- and bottom-ranked functional connections (Nenning et al., 2023)”.

      Furthermore, we appreciate the nudge to share our thoughts on whether the difference between vertex versus nodal metrics could be important here, particularly regarding thresholds. To combine this point with R2’s recommendation to expand the Discussion, we have added the following paragraph (Page 19, Line 861): 

      “We consider the role of thresholding, cortical resolution, and head motion as avenues to reconcile the present results with select reports in the literature (Dong et al., 2021; Xia et al., 2022). We would suggest that thresholding has a greater effect on vertex-level data, rather than parcel-level. For example, a recent study revealed that the emergence of principal vertex-level functional connectivity gradients in childhood and adolescence are indeed threshold-dependent (Dong et al., 2024). Specifically, the characteristic unimodal organisation for children and transmodal organisation for adolescents only emerged at the 90% threshold: a 95% threshold produced a unimodal organisation in both groups, whilst an 85% threshold produced a transmodal organisation in both groups. Put simply, the ‘swapping’ of gradient orders only occurs at certain thresholds. Furthermore, our results are not necessarily contradictory to this prior report (Dong et al., 2021): developmental changes in high-resolution gradients may be supported by a stable low-dimensional coarse manifold. Indeed, our decision to use parcellated connectomes was partly driven by recent work which demonstrated that vertex-level functional gradients may be derived using biologically-plausible but random data with sufficient spatial smoothing, whilst this effect is minimal at coarser resolutions (Watson & Andrews, 2023). We observed a gradual increase in the variance of individual connectomes accounted for by the principal functional connectivity gradient in the referred subset of CALM, in line with prior vertex-level work demonstrating a gradual emergence of the sensorimotor-association axis as the principal axis of connectivity (Xia et al., 2022), as opposed to a sudden shift. It is also possible that vertex-level data is more prone to motion artefacts in the context of developmental work. Transitioning from vertex-level to parcel-level data involves smoothing over short-range connectivity, thus greater variability in short-range connectivity can be observed in vertex-level data. However, motion artefacts are known to increase short-range connectivity and decrease long-range connectivity, mimicking developmental changes (Satterthwaite et al., 2013). Thus, whilst vertexlevel data offers greater spatial resolution in representation of short-range connectivity relative to parcel-level data, it is possible that this may come at the cost of making our estimates of the gradients more prone to motion”.

      Evaluating the consistency of gradients across development: the results shown in Figure 1e are used as evidence suggesting that gradients are consistent across ages. However, I believe additional analyses are required to identify potential sources of the observed inconsistency compared to previous works. The claim that the principal gradient explains a similar degree of variance across ages does not necessarily imply that the spatial structure remains the same. The observed variance explanation is hence not enough to ascertain inconsistency with findings from Dong et al., as the spatial configuration of gradients may still change over time. I suggest the following additional analyses to strengthen this claim. Alignment to group-level gradients: Assess how much of the variance in individual FC matrices is explained by each of the group-level gradients (G1, G2, and G3, for both FC and SC). This analysis could be visualized similarly to Figure 1e, with age on the x-axis and variance explained on the y-axis. If the explained variance varies as a function of age, it may indicate that the gradients are not as consistent as currently suggested. 

      This is another great suggestion. In the additional analyses above (new group-level analyses and unrotated gradient analyses) we rule-out a couple of the potential causes of the different developmental trends we observe in our data – namely the stability of the gradients over time. The suggested additional analysis is a great idea, and we have implemented it as follows (Page 8, Line 363):

      “To evaluate the consistency of gradients across development, across baseline participants with functional connectomes from the referred CALM cohort (N = 177), we calculated the proportion of variance in individuallevel connectomes accounted for by group-level functional gradients. Specifically, we calculated the proportion of variance in an adjacency matrix A accounted for by the vector v<sub>i</sub> as the fraction of the square of the scalar projection of v<sub>i</sub> onto A, over the Frobenius norm of A. Using a generalised linear model, we then tested whether the proportion of variance explained varies systematically with age, controlling for sex and headmotion. The variance in individual-level functional connectomes accounted for by the group-level principal functional gradient gradually increased with development (β= 0.111, 95% CI = [0.022, 0.199], p = 1.452 x 10<sup>-2</sup>, Cohen’s d = 0.367), as shown in Figure 1g, and decreased with higher head motion ( β = -10.041, 95% CI = [12.379, -7.702], p = 3.900 x 10<sup>-17</sup>), with no effect of sex (β= 0.071, 95% CI = [-0.380, 0.523], p = 0.757). We observed no developmental effects on the variance explained by the second (r<sub>s</sub> = 0.112, p = 0.139) or third (r<sub>s</sub> = 0.053, p = 0.482) group-level functional gradient. When repeated with the baseline functional connectivity for NKI (N = 213), we observed no developmental effects (β = 0.097, 95% CI = [-0.035, 0.228], p = 0.150) on the variance explained by the principal functional gradient after accounting for motion (β= -3.376, 95% CI = [8.281, 1.528], p = 0.177) and sex (β = -0.368, 95% CI = [-1.078, 0.342], p = 0.309). However, we observed significant developmental correlations between age and variance (r<sub>s</sub> = 0.137, p = 0.046) explained before accounting for head motion and sex. We observed no developmental effects on the variance explained by the second functional gradient (r<sub>s</sub> = -0.066, p = 0.338), but a weak negative developmental effect on the variance explained by the third functional gradient (r<sub>s</sub> = -0.189, p = 0.006). Note, however, the magnitude of the variance accounted for by the third functional gradient was very small (all < 1%). When applied to communicability matrices in CALM, the proportion of variance accounted for by the group-level communicability gradient was negligible (all < 1%), precluding analysis of developmental change”. 

      “To further probe the consistency of gradients across development, we examined developmental changes in the standard deviation of gradient values, corresponding to heterogeneity, following prior work examining morphological (He et al., 2025) and functional connectivity gradients (Xia et al., 2022). Using a series of generalised linear models within the baseline referred subset of CALM, correcting for head motion and sex, we found that gradient variation for the principal functional gradient increased across development (= 0.219, 95% CI = [0.091, 0.347], p = 0.001, Cohen’s d = 0.504), indicating greater heterogeneity (Figure 1h), whilst gradient variation for the principal communicability gradient decreased across development (β = -0.154, 95% CI = [-0.267, -0.040], p = 0.008, Cohen’s d = -0.301), indicating greater homogeneity (Figure 1h). Note, a paired t-test on the 173 common participants demonstrated a significant effect of modality on gradient variability (t(172) = -56.639, p = 3.663 x 10<sup>-113</sup>), such that the mean variability of communicability gradients (M = 0.033, SD = 0.001) was less than half that of functional connectivity (M = 0.076, SD = 0.010). Together, this suggests that principal functional connectivity and communicability gradients are established early in childhood and display age-related refinement, but not replacement”. 

      The Issue of Abstraction and Benefits of the Gradient-Based View: The manuscript interprets the eccentricity findings as reflecting changes along the segregation-integration spectrum. Given this, it is unclear why a more straightforward analysis using established graph-theory metrics of segregationintegration was not pursued instead. Mapping gradients and computing eccentricity adds layers of abstraction and complexity. If similar interpretations can be derived directly from simpler graph metrics, what additional insights does the gradient-based framework offer? While the manuscript argues that this approach provides “a more unifying account of cortical reorganization”, it is not evident why this abstraction is necessary or advantageous over traditional graph metrics. Clarifying these benefits would strengthen the rationale for using this method. 

      This is a great point, and something we spent quite a bit of time considering when designing the analysis. The central goal of our project was to identify gradients of brain organisation across different datasets and modalities and then test how the organisational principles of those modalities align. In other words, how do structural and functional ‘spaces’ intersect, and does this vary across the cortex? That for us was the primary motivation for operationalising organisation as nodal location within a low-dimensional manifold space (Bethlehem et al., 2020; Gale et al., 2022; Park et al., 2021), using a simple composite measure to achieve compression, rather than as a series of graph metrics. The reason we subsequently calculated those graph metrics and tested for their association was simply to help us interpret what eccentricity within that lowdimensional space means. Manifold eccentricity was moderately positively correlated to graph-theory metrics of integration, leaving a substantial portion of variance unaccounted for, but that association we think is nonetheless helpful for readers trying to interpret eccentricity. However, since ME tells us about the relative position of a node in that low-dimensional space, it is also likely capturing elements of multiple graph theory measures. Following the Reviewer’s question, this is something we decided to test. Specifically, using 4 measures of segregation, including two new metrics requested by the Reviewer in a minor point (weighted clustering coefficient and normalized degree centrality), we conducted a dominance analysis (Budescu, 1993) with normalized manifold eccentricity of the group-level referred CALM structural connectome. We also detail the use of gradient measures in developmental contexts, and how they can be complementary to traditional graph theory metrics. 

      We have added the following to the Results section (Page 10, Lines 472 onwards): 

      “To further contextualise manifold eccentricity in terms of integration and segregation beyond simple correlations, we conducted a multivariate dominance analysis (Budescu, 1993) of four graph theory metrics of segregation as predictors of nodal normalized manifold eccentricity within the group-level referred CALM structural and functional connectomes (Figure 2c). A dominance analysis assesses the relative importance of each predictor in a multilinear regression framework by fitting 2<sup>n</sup> – 1 models (where n is the number of predictors) and calculating the relative increase in adjusted R2 caused by adding each predictor to the model across both main effects and interactions. A multilinear regression model including weighted clustering coefficient, within-module degree Z-score, participation coefficient and normalized degree centrality accounted for 59% of variance in nodal manifold eccentricity in the group-level CALM structural connectome. Withinmodule degree Z score was the most important predictor (40.31% dominance), almost twice that of the participation coefficient (24.03% dominance) and normalized degree centrality (24.05% dominance) which made roughly equal contributions. The least important predictor was the weighted clustering coefficient (11.62% dominance). When the same approach was applied for the group-level referred CALM functional connectome, the 4 predictors accounted for 52% variability. However, in contrast to the structural connectome, functional manifold eccentricity seemed to incorporate the same graph theory metrics in different proportions. Normalized degree centrality was the most important predictor (47.41% dominance), followed by withinmodule degree Z-score (24.27%), and then the participation coefficient (15.57%) and weighted clustering coefficient (12.76%) which made approximately equal contributions. Thus, whilst structural manifold eccentricity was dominated most by within-module degree Z-score and least by the weighted clustering coefficient, functional manifold eccentricity was dominated most by normalized degree centrality and least by the weighted clustering coefficient. This suggests that manifold mapping techniques incorporate different aspects of integration dependent on modality. Together, manifold eccentricity acts as a composite measure of segregation, being differentially sensitive to different aspects of segregation, without necessitating a priori specification of graph theory metrics. Further discussion of the value of gradient-based metrics in developmental contexts and as a supplement to traditional graph theory analyses is provided in the ‘Manifold Eccentricity’ methodology sub-section”. 

      We added further justification to the manifold eccentricity Methods subsection (Page 26, line 1283):

      “Gradient-based measures hold value in developmental contexts, above and beyond traditional graph theory metrics: within a sample of over 600 cognitively-healthy adults aged between 18 and 88 years old, sensitivity of gradient-based within-network functional dispersion to age were stronger and more consistent across networks compared to segregation (Bethlehem et al., 2020). In the context of microstructural profile covariance, modules resolved by Louvain community detection occupied distinct positions across the principal two gradients, suggesting that gradients offer a way to meaningfully order discrete graph theory analyses (Paquola et al., 2019)”. 

      We added the following to the Introduction section outlining the application of gradients as cortex-wide coordinate systems (Page 3, Line 121):

      “Using the gradient-based approach as a compression tool, thus forgoing the need to specify singular graph theory metrics a priori, we operationalised individual variability in low-dimensional manifolds as eccentricity (Gale et al., 2022; Park et al., 2021). Crucially, such gradients appear to be useful predictors of phenotypic variation, exceeding edge-level connectomics. For example, in the case of functional connectivity gradients, their predictive ability for externalizing symptoms and general cognition in neurotypical adults surpassed that of edge-level connectome-based predictive modelling (Hong et al., 2020), suggesting that capturing lowdimensional manifolds may be particularly powerful biomarkers of psychopathology and cognition”. 

      We also added the following to the Discussion section (Page 18, Line 839):

      “By capitalising on manifold eccentricity as a composite measure of segregation across development, we build upon an emerging literature pioneering gradients as a method to establish underlying principles of structural (Paquola et al., 2020; Park et al., 2021) and functional (Dong et al., 2021; Margulies et al., 2016; Xia et al., 2022) brain development without a priori specification of specific graph theory metrics of interest”. 

      It is unclear whether the statistical tests finding significant dataset effects are capturing effects of neurotypical vs. Neurodivergent, or simply different scanners/sites. Could the neurotypical portion of CALM also be added to distinguish between these two sources of variability affecting dataset effects (i.e. ideally separating this to the effect of site vs. neurotypicality would better distinguish the effect of neurodivergence).

      At a group-level, differences in the gradients between the two cohorts are very minor. Indeed, in the manuscript we describe these gradients as being seemingly ‘universal’. But we agree that we should test whether we can directly attribute any simple main effects of ‘dataset’ are resulting from the different site or the phenotype of the participants. The neurotypical portion of CALM (collected at the same site on the same scanner) helped us show that any minor differences in the gradient alignments is likely due to the site/scanner differences rather than the phenotype of the participants. We took the same approach for testing the simple main effects of dataset on manifold eccentricity. To better parse neurotypicality and site effects at an individual-level, we conducted a series of sensitivity analyses. First, in response to the reviewer’s earlier comment, we conducted a series of nodal generalized linear models for communicability and FC gradients derived from neurotypical and neurodivergent portions of CALM, alongside NKI, and tested for an effect of neurotypicality above and beyond scanner. As at the group level, having those additional scans on a ‘comparison’ sample for CALM is very helpful in teasing apart these effects. We find that neurotypicality affects communicability gradient expression to a greater degree than functional connectivity. We visualised these results and added them to Figure 1. Second, we used the same approach but for manifold eccentricity. Again, we demonstrate greater sensitivity of neurotypicality to communicability at a global-level, but we cannot pin these effects down to specific networks because the effects do not survive the necessary multiple comparison correction. We have added these analyses to the manuscript (Page 13, Line 583): 

      “Much as with the gradients themselves, we suspected that much of the simple main effect of dataset could reflect the scanner / site, rather than the difference in phenotype. Again, we drew upon the CALM comparison children to help us disentangle these two explanations. As a sensitivity analysis to parse effects of neurotypicality and dataset on manifold eccentricity, we conducted a series of generalized linear models predicting mean global and network-level manifold eccentricity, for each modality. We did this across all the baseline data (i.e. including the neurotypical comparison sample for CALM) using neurotypicality (2 levels: neurodivergent or neurotypical), site (2 levels: CALM or NKI), sex, head motion, and age at scan (Figure 3X). We restricted our analysis to baseline scans to create more equally-balanced groups. In terms of structural manifold eccentricity (N = 313 neurotypical, N = 311 neurodivergent), we observed higher manifold eccentricity in the neurodivergent participants at a global level (β = 0.090, p = 0.019, Cohen’s d = 0.188) but the individual network level effects did not survive the multiple comparison correction necessary for looking across all seven networks, with the default-mode network being the strongest (β = 0.135, p = 0.027, p<sub>FDR</sub> = 0.109, Cohen’s d = 0.177). There was no significant effect of neurodiversity on functional manifold eccentricity (N = 292 neurotypical and N = 177 neurodivergent). This suggests that neurodiversity is significantly associated with structural manifold eccentricity, over and above differences in site, but we cannot distinguish these effects reliably in the functional manifold data”. 

      Third, we removed the Scheirer-Ray-Hare test from the results for two reasons. First, its initial implementation did not account for repeated measures, and therefore non-independence between observations, as the same participants may have contributed both structural and functional data. Second, if we wanted to repeat this analysis in CALM using the referred and control portions, a significant difference in group size existed, which may affect the measures of variability. Specifically, for baseline CALM, 311 referred and 91 control participants contributed SC data, whilst 177 referred and 79 control participants contributed FC data. We believe that the ‘cleanest’ parsing of dataset and site for effects of eccentricity is achieved using the GLMs in Figure 3. 

      We observed no significant effect of neurodivergence on the magnitude of structure-function coupling across development, and have added the following text (Page 14, Line 632):

      “To parse effects of neurotypicality and dataset on structure-function coupling, we conducted a series of generalized linear models predicting mean global and network-level coupling using neurotypicality, site, sex, head motion, and age at scan, at baseline (N = 77 CALM neurotypical, N = 173 CALM neurodivergent, and N = 170 NKI). However, we found no significant effects of neurotypicality on structure-function coupling across development”. 

      Since we demonstrated no significant effects of neurotypicality on structure-function coupling magnitude across development, but found differential dataset-specific effects of age on coupling development, we added the following sentence at the end of the coupling trajectory results sub-section (Page 14, line 664):

      “Together, these effects demonstrate that whilst the magnitude of structure-function coupling appears not to be sensitive to neurodevelopmental phenotype, its development with age is, particularly in higher-order association networks, with developmental change being reduced in the neurodivergent sample”.  

      Figure 1.c: A non-parametric permutation test (e.g. Mann-Whitney U test) could quantitatively identify regions with significant group differences in nodal gradient values, providing additional support for the qualitative findings.

      This is a great idea. To examine the effect of referral status on nodal gradient values, whilst controlling for covariates (head motion and sex), we conducted a series of generalised linear models. We opted for this instead of a Mann-Whitney U test, as the former tests for differences in distributions, whilst the direction of the t-statistic for referral status from the GLM would allow us to specify the magnitude and direction of differences in nodal gradient values between the two groups. Again, we conducted this in CALM (referred vs control), at an individual-level, as downstream analyses suggested a main effect of dataset (which is reflected in the highly-similar group-level referred and control CALM gradients). We have updated the Results section with the following text (Page 6, Line 283):

      “To examine the effect of referral status on participant-level nodal gradient values in CALM, we conducted a series of generalized linear models controlling for head motion, sex and age at scan (Figure 1d). We restricted our analyses to baseline scans to reduce the difference in sample size for the referred (311 communicability and 177 functional gradients, respectively) and control participants (91 communicability and 79 functional gradients, respectively), and to the principal gradients. For communicability, 42 regions showed a significant effect (p < 0.05) of neurodivergence before FDR correction, with 9 post FDR correction. 8 of these 9 regions had negative t-statistics, suggesting a reduced nodal gradient value and representation in the neurodivergent children, encompassing both lower-order somatosensory cortices alongside higher-order fronto-parietal and default-mode networks. The largest reductions were observed within the prefrontal cortices of the defaultmode network (t = -3.992, p = 6.600 x 10<sup>-5</sup>, p<sub>FDR</sub> = 0.013, Cohen’s d = -0.476), the left orbitofrontal cortex of the limbic network (t = -3.710, p = 2.070 x 10<sup>-4</sup>, p<sub>FDR</sub> = 0.020, Cohen’s d = -0.442) and right somato-motor cortex (t = -3.612, p = 3.040 x 10<sup>-4</sup>, p<sub>FDR</sub> = 0.020, Cohen’s d = -0.431). The right visual cortex was the only exception, with stronger gradient representation within the neurotypical cohort (t = 3.071, p = 0.002, p<sub>FDR</sub> = 0.048, Cohen’s d = 0.366). For functional connectivity, comparatively fewer regions exhibited a significant effect (p < 0.05) of neurotypicality, with 34 regions prior to FDR correction and 1 post. Significantly stronger gradient representation was observed in neurotypical children within the right precentral ventral division of the defaultmode network (t = 3.930, p = 8.500 x 10<sup>-5</sup>, p<sub>FDR</sub> = 0.017, Cohen’s d = 0.532). Together, this suggests that the strongest and most robust effects of neurodivergence are observed within gradients of communicability, rather than functional connectivity, where alterations in both affect higher-order associative regions”. 

      In the harmonization methodology, it is mentioned that “if harmonisation was successful, we’d expect any significant effects of scanner type before harmonisation to be non-significant after harmonisation”. However, given that there were no significant effects before harmonization, the results reported do not help in evaluating the quality of harmonization.

      We agree with the Reviewer, and have removed the post-harmonisation GLMs, and instead stating that there were no significant effects of scanner type before harmonization. 

      Figure 3: It would be helpful to include a plot showing the GAMM predictions versus real observations of eccentricity (x-axis: predictions, y-axis: actual values). 

      To plot the GAMM-predicted smooth effects of age, which we used for visualisation purposes only, we used the get_predictions function from the itsadug R package. This creates model predictions using the median value of nuisance covariates. Thus, whilst we specified the entire age range, the function automatically chooses the median of head motion, alongside controlling for sex (default level: male) and, for each dataset-specific trajectory. Since the gamm4 package separates the fitted model into a gam and linear mixed effects model (which accounts for participant ID as a random effect), and the get_predictions function only uses gam, random effects are not modelled in the predicted smooths. Therefore, any discrepancy between the observed and predicted manifold eccentricity values is likely due to sensitivity to default choices of covariates other than age, or random effects. To prevent Figure 3 being too over-crowded, we opted to not include the predictions: these were strongly correlated with real structural manifold data, but less for functional manifold data especially where significant developmental change was absent.

      The 30mm threshold for filtering short streamlines in tractography is uncommon. What is the rationale for using such a large threshold, given the potential exclusion of many short-range association fibres?

      A minimum length of 30mm was the default for the MRtrix3 reconstruction workflow, and something we have previously used. In a previous project, we systematically varied the minimum fibre length and found that this had minimal impact on network organisation (e.g. Mousley et al. 2025). However, we accept that short-range association fibres may have been excluded and have included this in the Discussion as a methodological limitation, alongside our predictions for how the gradients and structure-function coupling may’ve been altered had we included such fibres (Page 20, Line 955):

      “A potential methodological limitation in the construction of structural connectomes was the 30mm tract length threshold which, despite being the QSIprep reconstruction default (Cieslak et al., 2021), may have potentially excluded short-range association fibres. This is pertinent as tracts of different lengths exhibit unique distributions across the cortex and functional roles (Bajada et al., 2019) : short-range connections occur throughout the cortex but peak within primary areas, including the primary visual, somato-motor, auditory, and para-hippocampal cortices, and are thought to dominate lower-order sensorimotor functional resting-state networks, whilst long-range connections are most abundant in tertiary association areas and are recruited alongside tracts of varying lengths within higher-order functional resting-state networks. Therefore, inclusion of short-range association fibres may have resulted in a relative increase in representation of lower-order primary areas and functional networks. On the other hand, we also note the potential misinterpretation of short-range fibres: they may be unreliably distinguished from null models in which tractography is restricted by cortical gyri only (Bajada et al., 2019). Further, prior (neonatal) work has demonstrated that the order of connectivity of regions and topological fingerprints are consistent across varying streamline thresholds (Mousley et al., 2025), suggesting minimal impact”. 

      Given the spatial smoothing of fMRI data (6mm FWHM), it would be beneficial to apply connectome spatial smoothing to structural connectivity measures for consistent spatial smoothness.

      This is an interesting suggestion but given we are looking at structural communicability within a parcellated network, we are not sure that it would make any difference. The data structural data are already very smooth. Nonetheless we have added the following text to the Discussion (Page 20, Line 968): 

      “Given the spatial smoothing applied to the functional connectivity data, and examining its correspondence to streamline-count connectomes through structure-function coupling, applying the equivalent smoothing to structural connectomes may improve the reliability of inference, and subsequent sensitivity to cognition and psychopathology. Connectome spatial smoothing involves applying a smoothing kernel to the two streamline endpoints, whereby variations in smoothing kernels are selected to optimise the trade-off between subjectlevel reliability and identifiability, thus increasing the signal-to-noise ratio and the reliability of statistical inferences of brain-behaviour relationships (Mansour et al., 2022). However, we note that such smoothing is more effective for high-resolution connectomes, rather than parcel-level, and so have only made a modest improvement (Mansour et al., 2022)”.

      Why was harmonization performed only within the CALM dataset and not across both CALM and NKI datasets? What was the rationale for this decision?

      We thought about this very carefully. Harmonization aims to remove scanner or site effects, whilst retaining the crucial characteristics of interest. Our capacity to retain those characteristics is entirely dependent on them being *fully* captured by covariates, which are then incorporated into the harmonization process. Even with the best set of measures, the idea that we can fully capture ‘neurodivergence’ and thus preserve it in the harmonisation process is dubious. Indeed, across CALM and NKI there are limited number of common measures (i.e. not the best set of common measures), and thus we are limited in our ability to fully capture the neurodivergence with covariates. So, we worried that if we put these two very different datasets into the harmonisation process we would essentially eliminate the interesting differences between the datasets. We have added this text to the harmonization section of the Methods (Page 24, Line 1225):

      “Harmonization aims to retain key characteristics of interest whilst removing scanner or site effects. However, the site effects in the current study are confounded with neurodivergence, and it is unlikely that neurodivergence may be captured fully using common covariates across CALM and NKI. Therefore, to preserve variation in neurodivergence, whilst reducing scanner effects, we harmonized within the CALM dataset only”. 

      The exclusion of subcortical areas from connectivity analyses is not justified. 

      This is a good point. We used the Schaefer atlas because we had previously used this to derive both functional and structural connectomes, but we agree that it would have been good to include subcortical areas (Page 20, Line 977). 

      “A potential limitation of our study was the exclusion of subcortical regions. However, prior work has shed light on the role of subcortical connectivity in structural and functional gradients, respectively, of neurotypical populations of children and adolescents (Park et al., 2021; Xia et al., 2022). For example, in the context of the primary-to-transmodal and sensorimotor-to-visual functional connectivity gradients, the mean gradient scores within subcortical networks were demonstrated to be relatively stable across childhood and adolescence (Xia et al., 2022). In the context of structural connectivity gradients derived from streamline counts, which we demonstrated were highly consistent with those derived from communicability, subcortical structural manifolds weighted by their cortical connectivity were anchored by the caudate and thalamus at one pole, and by the hippocampus and nucleus accumbens at the opposite pole, with significant age-related manifold expansion within the caudate and thalamus (Park et al., 2021)”. 

      In the KNN imputation method, were uniform weights used, or was an inverse distance weighting applied?

      Uniform weights were used, and we have updated the manuscript appropriately.

      The manuscript should clarify from the outset that the reported sample size (N) includes multiple longitudinal observations from the same individuals and does not reflect the number of unique participants.

      We have rectified the Abstract (Page 2, Line 64) and Introduction (Page 3, Line 138):

      “We charted the organisational variability of structural (610 participants, N = 390 with one observation, N = 163 with two observations, and N = 57 with three) and functional (512 participants, N = 340 with one observation, N = 128 with two observations, and N = 44 with three)”.

      The term “structural gradients” is ambiguous in the introduction. Clarify that these gradients were computed from structural and functional connectivity matrices, not from other structural features (e.g. cortical thickness).

      We have clarified this in the Introduction (Page 3, Line 134):

      “Applying diffusion-map embedding as an unsupervised machine-learning technique onto matrices of communicability (from streamline SIFT2-weighted fibre bundle capacity) and functional connectivity, we derived gradients of structural and functional brain organisation in children and adolescents…”

      Page 5: The sentence, “we calculated the normalized angle of each structural and functional connectome to derive symmetric affinity matrices” is unclear and needs clarification.

      We have clarified this within the second paragraph of the Results section (Page 4, Line 185):

      “To capture inter-nodal similarity in connectivity, using a normalised angle kernel, we derived individual symmetric affinity matrices from the left and right hemispheres of each communicability and functional connectivity matrix. Varying kernels capture different but highly-related aspects of inter-nodal similarity, such as correlation coefficients, Gaussian kernels, and cosine similarity. Diffusion-map embedding is then applied on the affinity matrices to derive gradients of cortical organisation”. 

      Figure 1.a: “Affine A” likely refers to the affinity matrix. The term “affine” may be confusing; consider using a clearer label. It would also help to add descriptive labels for rows and columns (e.g. region x region).

      Thank you for this suggestion! We have replaced each of the labels with “pairwise similarity”. We also labelled the rows and columns as regions.

      Figure 1.d: Are the cross-group differences statistically significant? If so, please indicate this in the figure.

      We have added the results of a series of linear mixed effects models to the legend of Figure 1 (Page 6, line 252):

      “indicates a significant effect of dataset (p < 0.05) on variance explained within a linear mixed effects model controlling for head motion, sex, and age at scan”.

      The sentence “whose connectomes were successfully thresholded” in the methods is unclear. What does “successfully thresholded” mean? Additionally, this seems to be the first mention of the Schaefer 100 and Brainnetome atlas; clarify where these parcellations are used. 

      We have amended the Methodology section (Page 23, Line 1138):

      “For each participant, we retained the strongest 10% of connections per row, thus creating fully connected networks required for building affinity matrices. We excluded any connectomes in which such thresholding was not possible due to insufficient non-zero row values. To further ensure accuracy in connectome reconstruction, we excluded any participants whose connectomes failed thresholding in two alternative parcellations: the 100node Schaefer 7-network (Schaefer et al., 2018) and Brainnetome 246-node (Fan et al., 2016) parcellations, respectively”. 

      We have also specified the use of the Schaefer 200-node parcellation in the first sentence on the second Results paragraph.

      The use of “streamline counts” is misleading, as the method uses SIFT2-weighted fibre bundle capacity rather than raw streamline counts. It would be better to refer to this measure as “SIFT2-weighted fibre bundle capacity” or “FBC”.

      We replaced all instances of “streamline counts” with “SIFT2-weighted fibre bundle capacity” as appropriate.

      Figure 2.c: Consider adding plots showing changes in eccentricity against (1) degree centrality, and (2) weighted local clustering coefficient. Additionally, a plot showing the relationship between age and mean eccentricity (averaged across nodes) at the individual level would be informative.

      We added the correlation between eccentricity and both degree centrality and the weighted local clustering coefficient and included them in our dominance analysis in Figure 2. In terms of the relationship between age and mean (global) eccentricity, these are plotted in Figure 3. 

      Figure 2.b: Considering the results of the following sections, it would be interesting to include additional KDE/violin plots to show group differences in the distribution of eccentricity within 7 different functional networks.

      As part of our analysis to parse neurotypicality and dataset effects, we tested for group differences in the distribution of structural and functional manifold eccentricity within each of the 7 functional networks in the referred and control portions of CALM and have included instances of significant differences with a coloured arrow to represent the direction of the difference within Figure 3. 

      Figure 3: Several panels lack axis labels for x and y axes. Adding these would improve clarity.

      To minimise the amount of text in Figure 3, we opted to include labels only for the global-level structural and functional results. However, to aid interpretation, we added a small schematic at the bottom of Figure 3 to represent all axis labels. 

      The statement that “differences between datasets only emerged when taking development into account” seems inaccurate. Differences in eccentricity are evident across datasets even before accounting for development (see Fig 2.b and the significance in the Scheirer-Ray-Hare test).

      We agree – differences in eccentricity across development and datasets are evident in structural and functional manifold eccentricity, as well as within structure-function coupling. However, effects of neurotypicality were particularly strong for the maturation of structure-function coupling, rather than magnitude. Therefore, we have rephrased this sentence in the Discussion (page 18, line 832):

      “Furthermore, group-level structural and functional gradients were highly consistent across datasets, whilst differences between datasets were emphasised when taking development into account, through differing rates of structural and functional manifold expansion, respectively, alongside maturation of structure-function coupling”.

      The handling of longitudinal data by adding a random effect for individuals is not clear in the main text. Mentioning this earlier could be helpful. 

      We have included this detail in the second sentence of the “developmental trajectories of structural manifold contraction and functional manifold expansion” results sub-section (page 11, line 503):

      “We included a random effect for each participant to account for longitudinal data”. 

      Figure 4.b: Why were ranks shown instead of actual coefficient of variation values? Consider including a cortical map visualization of the coefficients in the supplementary material.

      We visualised the ranks, instead of the actual coefficient of variation (CV) values, due to considerable variability and skew in the magnitude of the CV, ranging from 28.54 (in the right visual network) to 12865.68 (in the parietal portion of the left default-mode network), with a mean of 306.15. If we had visualised the raw CV values, these larger values would’ve been over-represented. We’ve also noticed and rectified an error in the labelling of the colour bar for Figure 4b: the minimum should be most variable (i.e. a rank of 1). To aid contextualisation of the ranks, we have added the following to the Results (page 14, line 626):

      “The distribution of cortical coefficients of variation (CV) varied considerably, with the largest CV (in the parietal division of the left default-mode network) being over 400 times that of the smallest (in the right visual network). The distribution of absolute CVs was positively skewed, with a Fisher skewness coefficient g<sub>1</sub> of 7.172, meaning relatively few regions had particularly high inter-individual variability, and highly peaked, with a kurtosis of 54.883, where a normal distribution has a skewness coefficient of 0 and a kurtosis of 3”. 

      Reviewer #2 (Public review):

      Some differences in developmental trajectories between CALM and NKI (e.g. Figure 4d) are not explained. Are these differences expected, or do they suggest underlying factors that require further investigation?

      This is a great point, and we appreciate the push to give a fuller explanation. It is very hard to know whether these effects are expected or not. We certainly don’t know of any other papers that have taken this approach. In response to the reviewer’s point, we decided to run some more analyses to better understand the differences. Having observed stronger age effects on structure-function coupling within the neurotypical NKI dataset, compared to the absent effects in the neurodivergent portion of CALM, we wanted to follow up and test that it really is that coupling is more sensitive to the neurodivergent versus neurotypical difference between CALM and NKI (rather than say, scanner or site effects). In short, we find stronger developmental effects of coupling within the neurotypical portion of CALM, rather than neurodivergent, and have added this to the Results (page 15, line 701):

      “To further examine whether a closer correspondence of structure-function coupling with age is associated with neurotypicality, we conducted a follow-up analysis using the additional age-matched neurotypical portion of CALM (N = 77). Given the widespread developmental effects on coupling within the neurotypical NKI sample, compared to the absent effects in the neurodivergent portion of CALM, we would expect strong relationships between age and structure-function coupling with the neurotypical portion of CALM. This is indeed what we found: structure-function coupling showed a linear negative relationship with age globally (F = 16.76, p<sub>FDR</sub> < 0.001, adjusted R<sup>2</sup> = 26.44%), alongside fronto-parietal (F = 9.24, p<sub>FDR</sub> = 0.004, adjusted R<sup>2</sup> = 19.24%), dorsalattention (F = 13.162, p<sub>FDR</sub> = 0.001, adjusted R<sup>2</sup>= 18.14%), ventral attention (F = 11.47, p<sub>FDR</sub>  = 0.002, adjusted R<sup>2</sup>= 22.78), somato-motor (F = 17.37, p<sub>FDR</sub>  < 0.001, adjusted R<sup>2</sup>= 21.92%) and visual (F = 11.79, p<sub>FDR</sub>  = 0.002, adjusted R<sup>2</sup>= 20.81%) networks. Together, this supports our hypothesis that within neurotypical children and adolescents, structure-function coupling decreases with age, showing a stronger effect compared to their neurodivergent counterparts, in tandem with the emergence of higher-order cognition. Thus, whilst the magnitude of structure-function coupling across development appeared insensitive to neurotypicality, its maturation is sensitive. Tentatively, this suggests that neurotypicality is linked to stronger and more consistent maturational development of structure-function coupling, whereby the tethering of functional connectivity to structure across development is adaptive”. 

      In conjunction with the Reviewer’s later request to deepen the Discussion, we have included an additional paragraph attempting to explain the differences in neurodevelopmental trajectories of structure-function coupling (Page 19, Line 924):

      “Whilst the spatial patterning of structure-function coupling across the cortex has been extensively documented, as explained above, less is known about developmental trajectories of structure-function coupling, or how such trajectories may be altered in those with neurodevelopmental conditions. To our knowledge, only one prior study has examined differences in developmental trajectories of (non-manifold) structure-function coupling in typically-developing children and those with attention-deficit hyperactivity disorder (Soman et al., 2023), one of the most common conditions in the neurodivergent portion of CALM. Namely, using cross-sectional and longitudinal data from children aged between 9 and 14 years old, they demonstrated increased coupling across development in higher-order regions overlapping with the defaultmode, salience, and dorsal attention networks, in children with ADHD, with no significant developmental change in controls, thus encompassing an ectopic developmental trajectory (Di Martino et al., 2014; Soman et al., 2023). Whilst the current work does not focus on any condition, rather the broad mixed population of young people with neurodevelopmental symptoms (including those with and without diagnoses), there are meaningful individual and developmental differences in structure-coupling. Crucially, it is not the case that simply having stronger coupling is desirable. The current work reveals that there are important developmental trajectories in structure-function coupling, suggesting that it undergoes considerable refinement with age. Note that whilst the magnitude of structure-function coupling across development did not differ significantly as a function of neurodivergence, its relationship to age did. Our working hypothesis is that structural connections allow for the ordered integration of functional areas, and the gradual functional modularisation of the developing brain. For instance, those with higher cognitive ability show a stronger refinement of structurefunction coupling across development. Future work in this space needs to better understand not just how structural or functional organisation change with time, but rather how one supports the other”. 

      The use of COMBAT may have excluded extreme participants from both datasets, which could explain the lack of correlations found with psychopathology.

      COMBAT does not exclude participants from datasets but simply adjusts connectivity estimates. So, the use of COMBAT will not be impacting the links with psychopathology by removing participants. But this did get us thinking. Excluding participants based on high motion may have systematically removed those with high psychopathology scores, meaning incomplete coverage. In other words, we may be under-representing those at the more extreme end of the range, simply because their head-motion levels are higher and thus are more likely to be excluded. We found that despite certain high-motion participants being removed, we still had good coverage of those with high scores and were therefore sensitive within this range. We have added the following to the revised Methods section (Page 26, Line 1338):

      “As we removed participants with high motion, this may have overlapped with those with higher psychopathology scores, and thus incomplete coverage. To examine coverage and sensitivity to broad-range psychopathology following quality control, we calculated the Fisher-Pearson skewness statistic g<sub>1</sub> for each of the 6 Conners t-statistic measures and the proportion of youth with a t-statistic equal to or greater than 65, indicating an elevated or very elevated score. Measures of inattention (g<sub>1</sub> = 0.11, 44.20% elevated), hyperactivity/impulsivity (g<sub>1</sub> = 0.48, 36.41% elevated), learning problems (g<sub>1</sub> = 0.45, 37.36% elevated), executive functioning (g<sub>1</sub> = 0.27, 38.16% elevated), aggression (g<sub>1</sub> = 1.65, 15.58% elevated), and peer relations (g<sub>1</sub> = 0.49, 38% elevated) were positively skewed and comprised of at least 15% of children with elevated or very elevated scores, suggesting sufficient coverage of those with extreme scores”. 

      There is no discussion of whether the stable patterns of brain organization could result from preprocessing choices or summarizing data to the mean. This should be addressed to rule out methodological artifacts. 

      This is a brilliant point. We are necessarily using a very lengthy pipeline, with many design choices to explore structural and functional gradients and their intersection. In conjunction with the Reviewer’s later suggestion to deepen the Discussion, we have added the following paragraph which details the sensitivity analyses we carried out to confirm the observed stable patterns of brain organization (Page 18, Line 863):

      “That is, whilst we observed developmental refinement of gradients, in terms of manifold eccentricity, standard deviation, and variance explained, we did not observe replacement. Note, as opposed to calculating gradients based on group data, such as a sliding window approach, which may artificially smooth developmental trends and summarise them to the mean, we used participant-level data throughout. Given the growing application of gradient-based analyses in modelling structural (He et al., 2025; Li et al., 2024) and functional (Dong et al., 2021; Xia et al., 2022) brain development, we hope to provide a blueprint of factors which may affect developmental conclusions drawn from gradient-based frameworks”.

      Although imputing missing data was necessary, it would be useful to compare results without imputed data to assess the impact of imputation on findings. 

      It is very hard to know the impact of imputation without simply removing those participants with some imputed data. Using a simulation experiment, we expressed the imputation accuracy as the root mean squared error normalized by the range of observable data in each scale. This produced a percentage error margin. We demonstrate that imputation accuracy across all measures is at worst within approximately 11% of the observed data, and at best within approximately 4% of the observed data, and have included the following in the revised Methods section (Page 27, Line 1348):

      “Missing data

      To avoid a loss of statistical power, we imputed missing data. 27.50% of the sample had one or more missing psychopathology or cognitive measures (equal to 7% of all values), and the data was not missing at random: using a Welch’s t-test, we observed a significant effect of missingness on age [t (264.479) = 3.029, p = 0.003, Cohen’s d = 0.296], whereby children with missing data (M = 12.055 years, SD = 3.272) were younger than those with complete data (M = 12.902 years, SD = 2.685). Using a subset with complete data (N = 456), we randomly sampled 10% of the values in each column with replacement and assigned those as missing, thereby mimicking the proportion of missingness in the entire dataset. We conducted KNN imputation (uniform weights) on the subset with complete data and calculated the imputation accuracy as the root mean squared error normalized by the observed range of each measure. Thus, each measure was assigned a percentage which described the imputation margin of error. Across cognitive measures, imputation was within a 5.40% mean margin of error, with the lowest imputation error in the Trail motor speed task (4.43%) and highest in the Trails number-letter switching task (7.19%). Across psychopathology measures, imputation exhibited a mean 7.81% error margin, with the lowest imputation error in the Conners executive function scale (5.75%) and the highest in the Conners peer relations scale (11.04%). Together, this suggests that imputation was accurate”.

      The results section is extensive, with many reports, while the discussion is relatively short and lacks indepth analysis of the findings. Moving some results into the discussion could help balance the sections and provide a deeper interpretation. 

      We agree with the Reviewer and appreciate the nudge to expand the Discussion section. We have added 4 sections to the Discussion. The first explores the importance of the default-mode network as a region whose coupling is most consistently predicted by working memory across development and phenotypes, in terms of its underlying anatomy (Paquola et al., 2025) (Page 20, Line 977):

      “An emerging theme from our work is the importance of the default-mode network as a region in which structure-function coupling is reliably predicted by working memory across neurodevelopmental phenotypes and datasets during childhood and adolescence. Recent neurotypical adult investigations combining highresolution post-mortem histology, in vivo neuroimaging, and graph-theory analyses have revealed how the underlying neuroanatomy of the default-mode network may support diverse functions (Paquola et al., 2025), and thus exhibit lower structure-function coupling compared to unimodal regions. The default-mode network has distinct neuroanatomy compared to the remaining 6 intrinsic resting-state functional networks (Yeo et al., 2011), containing a distinctive combination of 5 of the 6 von Economo and Koskinas cell types (von Economo & Koskinas, 1925), with an over-representation of heteromodal cortex, and uniquely balancing output across all cortical types. A primary cytoarchitectural axis emerges, beyond which are mosaic-like spatial topographies. The duality of the default-mode network, in terms of its ability to both integrate and be insulated from sensory information, is facilitated by two microarchitecturally distinct subunits anchored at either end of the cytoarchitectural axis (Paquola et al., 2025). Whilst beyond the scope of the current work, structure-function coupling and their predictive value for cognition may also differ across divisions within the default-mode network, particularly given variability in the smoothness and compressibility of cytoarchitectural landscapes across subregions (Paquola et al., 2025)”. 

      The second provides a deeper interpretation and contextualisation of greater sensitivity of communicability, rather than functional connectivity, to neurodivergence (Page 19, Lines 907):

      “We consider two possible factors to explain the greater sensitivity of neurodivergence to gradients of communicability, rather than functional connectivity. First, functional connectivity is likely more sensitive to head motion than structural-based communicability and suffers from reduced statistical power due to stricter head motion thresholds, alongside greater inter-individual variability. Second, whilst prior work contrasting functional connectivity gradients from neurotypical adults with those with confirmed ASD diagnoses demonstrated vertex-level reductions in the default-mode network in ASD and marginal increases in sensorymotor communities (Hong et al., 2019), indicating a sensitivity of functional connectivity to neurodivergence, important differences remain. Specifically, whilst the vertex-level group-level differences were modest, in line with our work, greater differences emerged when considering step-wise functional connectivity (SFC); in other words, when considering the dynamic transitions of or information flow through the functional hierarchy underlying the static functional connectomes, such that ASD was characterised by initial faster SFC within the unimodal cortices followed by a lack of convergence within the default-mode network (Hong et al., 2019). This emphasis on information flow and dynamic underlying states may point towards greater sensitivity of neurodivergence to structural communicability – a measure directly capturing information flow – than static functional connectivity”. 

      The third paragraph situates our work within a broader landscape of reliable brain-behaviour relationships, focusing on the strengths of combining clinical and normative samples to refine our interpretation of the relationship between gradients and cognition, as well as the importance of equifinality in developmental predictive work (Page 20, line 994):

      “In an effort to establish more reliable brain-behaviour relationships despite not having the statistical power afforded by large-scale, typically normative, consortia (Rosenberg & Finn, 2022), we demonstrated the development-dependent link between default-mode structure-function coupling and working memory generalised across clinical (CALM) and normative (NKI) samples, across varying MRI acquisition parameters, and harnessing within- and across-participant variation. Such multivariate associations are likely more reliable than their univariate counterparts (Marek et al., 2022), but can be further optimised using task-related fMRI (Rosenberg & Finn, 2022). The consistency, or lack of, of developmental effects across datasets emphasises the importance of validating brain-behaviour relationships in highly diverse samples. Particularly evident in the case of structure-function coupling development, through our use of contrasting samples, is equifinality (Cicchetti & Rogosch, 1996), a key concept in developmental neuroscience: namely, similar ‘endpoints’ of structure-function coupling may be achieved through different initialisations dependent on working memory. 

      The fourth paragraph details methodological limitations in response to Reviewer 1’s suggestions to justify the exclusion of subcortical regions and consider the role of spatial smoothing in structural connectome construction as well as the threshold for filtering short streamlines”. 

      While the methods are thorough, it is not always clear whether the optimal approaches were chosen for each step, considering the available data. 

      In response to Reviewer 1’s concerns, we conducted several sensitivity analyses to evaluate the robustness of our results in terms of procedure. Specifically, we evaluated the impact of thresholding (full or sparse), level of analysis (individual or group gradients), construction of the structural connectome (communicability or fibre bundle capacity), Procrustes rotation (alignment to group-level gradients before Procrustes), tracking the variance explained in individual connectomes by group-level gradients, impact of head motion, and distinguishing between site and neurotypicality effects. All these analyses converged on the same conclusion: whilst we observe some developmental refinement in gradients, we do not observe replacement. We refer the reviewer to their third point, about whether stable patterns of brain organization were artefactual. 

      The introduction is overly long and includes numerous examples that can distract readers unfamiliar with the topic from the main research questions. 

      We have removed the following from the Introduction, reducing it to just under 900 words:

      “At a molecular level, early developmental patterning of the cortex arises through interacting gradients of morphogens and transcription factors (see Cadwell et al., 2019). The resultant areal and progenitor specialisation produces a diverse pool of neurones, glia, and astrocytes (Hawrylycz et al., 2015). Across childhood, an initial burst in neuronal proliferation is met with later protracted synaptic pruning (Bethlehem et al., 2022), the dynamics of which are governed by an interplay between experience-dependent synaptic plasticity and genomic control (Gottlieb, 2007)”.

      “The trends described above reflect group-level developmental trends, but how do we capture these broad anatomical and functional organisational principles at the level of an individual?”

      We’ve also trimmed the second Introduction paragraph so that it includes fewer examples, such as removal of the wiring-cost optimisation that underlies structural brain development, as well as removing specific instances of network segregation and integration that occur throughout childhood.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Strengths: 

      The work uses a simple and straightforward approach to address the question at hand: is dynein a processive motor in cells? Using a combination of TIRF and spinning disc confocal microscopy, the authors provide a clear and unambiguous answer to this question. 

      Thank you for the recognition of the strength of our work

      Weaknesses: 

      My only significant concern (which is quite minor) is that the authors focus their analysis on dynein movement in cells treated with docetaxol, which could potentially affect the observed behavior. However, this is likely necessary, as without it, motility would not have been observed due to the 'messiness' of dynein localization in a typical cell (e.g., plus end-tracking in addition to cargo transport).

      You are exactly correct that this treatment was required to provided us a clear view of motile dynein and p50 puncta. One concern about the treatment that we had noted in our original submission was that the docetaxel derivative SiR tubulin could increase microtubule detyrosination, which has been implicated in affecting the initiation of dynein-dynactin motility but not motility rates (doi: 10.15252/embj.201593071). In response to a comment from reviewer 2 we investigated whether there was a significant increase in alpha-tubulin detyrosination in our treatment conditions and found that there was not. We have removed the discussion of this possibility from the revised version. Please also see response to comments raised by reviewer 2. 

      Reviewer 1 (Recommendations for the authors):

      Major points: 

      (1) The authors measured kinesin-1-GFP intensities in a different cell line (drosophila S2 cells) than what was used for the DHC and p50 measurements (HeLa cells). It is unclear if this provides a fair comparison given the cells provide different environments for the GFP. Although the differences may in fact be trivial, without somehow showing this is indeed a fair comparison, it should at least be noted as a caveat when interpreting relative intensity differences. Alternatively, the authors could compare DHC and p50 intensities to those measured from HeLa cells treated with taxol. 

      Thank you for this suggestion. We conducted new rounds of imaging with the DHCEGFP and p50-EGFP clones in conjunction with HeLa cells transiently expressing the human kinesin-1-EGFP and now present the datasets from the new experiments. Importantly, our new data was entirely consistent with the prior analyses as there was not a significant difference between the kinesin-1-EGFP dimer intensities and the DHC-EGFP puncta intensities and there was a statistically significant difference in the intensity of p50 puncta, which were approximately half the intensity of the kinesin-1 and DHC. We have moved the old data comparing the intensities in S2 cells expressing kinesin-1-EGFP to Figure 3 - figure supplement 2 A-D and the new HeLa cell data is now shown in Figure 3 D-G.

      (2) Given the low number of observations (41-100 puncta), I think a scatter plot showing all data points would offer readers a more transparent means of viewing the single-molecule data presented in Figures 3A, B, C, and G. I also didn't see 'n' values for plots shown in Figure 3. 

      The box and whisker plots have now been replaced with scatter plots showing all data points. The accompanying ‘n’ values have been included in the figure 3 legend as well as the histograms in figures 1 and 2 that are represented in the comparative scatter plots.  

      (3) Given the authors have produced a body of work that challenges conclusions from another pre-print (Tirumala et al., 2022 bioRxiv) - specifically, that dynein is not processive in cells - I think it would be useful to include a short discussion about how their work challenges theirs. For example, one significant difference between the two experimental systems that may account for the different observations could simply be that the authors of the Tirumala study used a mouse DHC (in HeLa cells), which may not have the ability to assemble into active and processive dynein-dynactin-adaptor complexes. 

      Thank you for pointing this out! At the time we submitted our manuscript we were conflicted about citing a pre-print that had not been peer reviewed simply to point out the discrepancy. If we had done so at that time we would have proposed the exact potential technical issue that you have proposed here. However, at the time we felt it would be better for these issues to be addressed through the review process. Needless to say, we agree with your interpretation and now that the work is published (Tirumala et al. JCB, 2024) it is entirely appropriate to add a discussion on Tirumala et al. where contradictory observations were reported. 

      The following statement has been added to the manuscript: 

      “In contrast, a separate study (Tirumala et al., 2024) reported that dynein is not highly processive, typically exhibiting runs of very short duration (~0.6 s) in HeLa cells. A notable technical difference that may account for this discrepancy is that our study visualizes endogenously tagged human DHC, whereas Tirumala et al. characterized over-expressed mouse DHC in HeLa cells. Over-expression of the DHC may result in an imbalance of the subunits that comprise the active motor complex, leading to inactive, or less active complexes. Similarly, mouse DHC may not have the ability to efficiently assemble into active and processive dynein-dynactin-adaptor complexes to the same extent as human DHC.”

      Minor points: 

      (1) "Specifically, the adaptor BICD2 recruited a single dynein to dynactin while BICDR1 and HOOK3 supported assembly of a "double dynein" complex." It would be more accurate to say that dynein-dynactin complexes assembled with Bicd2 "tend to favor single dynein, and the Bicdr1 and Hook3 tend to favor two dyneins" since even Bicd2 can support assembly of 2 dynein-1 dynactin complexes (see Urnavicius et al, Nature 2018). 

      Thank you, the manuscript has been edited to reflect this point. 

      (2) "Human HeLa cells were engineered using CRISPR/Cas9 to insert a cassette encoding FKBP and EGFP tags in the frame at the 3' end of the dynein heavy chain (DYNC1H1) gene (SF1)." It is unclear to what "SF1" is referring. 

      SF1 is supplementary figure 1, which we have now clarified as being Figure 1 – figure supplement 1A.

      (3) "The SiR-Tubulin-treated cells were subjected to two-color TIRFM to determine if the DHC puncta exhibited motility and; indeed, puncta were observed streaming along MTs..." This sentence is strangely punctuated (the ";" is likely a typo?). 

      Thank you for pointing this out, the typo has been corrected and the sentence now reads:

      “The SiR-Tubulin-treated cells were subjected to two-color TIRFM and DHC-EGFP puncta were clearly observed streaming on Sir-Tubulin labeled MTs, which was especially evident on MTs that were pinned between the nucleus and the plasma membrane (Video 3)”

      (4) I am unfamiliar with the "MK" acronym shown above the molecular weight ladders in Figure 3H and I. Did the authors mean to use "MW" for molecular weight? 

      We intended this to mean MW and the typo has been corrected.

      (5) "This suggests that the cargos, which we presume motile dynein-dynactin puncta are bound to, any kinesins..." This sentence is confusing as written. Did the authors mean "and kinesins"? 

      Agreed. We have changed this sentence to now read: 

      “The velocity and low switching frequency of motile puncta suggest that any kinesin motors associated with cargos being transported by the dynein-dynactin visualized here are inactive and/or cannot effectively bind the MT lattice during dynein-dynactin-mediated transport in interphase HeLa cells.”

      Reviewer 2 (Recommendations for the authors):

      (1) I am confused as to why the authors introduced an FKBP tag to the DHC and no explanation is given. Is it possible this tag induces artificial dimerization of the DHC? 

      FKBP was tagged to DHC for potential knock sideways experiments. Since the current cell line does not express the FKBP counterpart FRB, having FKBP alone in the cell line would not lead to artificial dimerization of DHC.

      (2) The authors use a high concentration of SiR-tubulin (1uM) before washing it out. However, they observe strong effects on MT dynamics. The manufacturer states that concentrations below 100nM don't affect MT dynamics, so I am wondering why the authors are using such a high amount that leads to cellular phenotypes. 

      We would like to note that in our hands even 100 nM SiR-tubulin impacted MT dynamics if it was incubated for enough time to get a bright signal for imaging, which makes sense since drugs like docetaxel and taxol become enriched in cells over time. Thus, it was a trade-off between the extent/brightness of labeling and the effects on MT dynamics. We opted for shorter incubation with a higher concentration of Sir-Tubulin to achieve rapid MT labeling and efficient suppression of plus-end MT polymerization. This approach proved useful for our needs since the loss of the tip-tacking pool of DHC provided a clearer view of the motile population of MT-associated DHC.

      (3) The individual channels should be labeled in the supplemental movies. 

      They have now been labelled.

      (4) I would like to see example images and kymographs of the GFP-Kinesin-1 control used for fluorescent intensity analysis. Further, the authors use the mean of the intensity distribution, but I wonder why they don't fit the distribution to a Gaussian instead, as that seems more common in the field to me. Do the data fit well to a Gaussian distribution? 

      Example images and kymographs of the kinesin-1-EGFP control HeLa cells used for the updated fluorescent intensity analysis have been now added to the manuscript in Figure 3 - figure supplement 1. The kinesin-1-EGFP transiently expressed in HeLa cells exhibited a slower mean velocity and run length than the endogenously tagged HeLa dynein-dynactin. Regarding the distribution, we applied 6 normality tests to the new datasets acquired with DHC and p50 in comparison to human kinesin-EGFP in HeLa cells. While we are confident concluding that the data for p50 was normally distributed (p > 0.05 in 6/6), it was more difficult to reach conclusions about the normality of the datasets for kinesin-1 (p > 0.05 in 4/6) and DHC (p > 0.5 in 1/6). We have decided to report the data as scatter plots (per the suggestion in major point 1 by reviewer 1) in the new Figure 3G since it could be misleading to fit a non-normal distribution with a single Gaussian. We note that the likely non-normal distribution of the DHC data (since it “passed” only 1/6 normality tests) could reflect the presence of other populations (e.g. 1 DHC-EGFP in a motile puncta), but we could also not confidently conclude this since attempting to fit the data with a double Gaussian did not pass statistical muster. Indeed, as stated in the text, on lines 197-198 we do not exclude that the range of DHC intensities measured here may include sub-populations of complexes containing a single dynein dimer with one DHC-EGFP molecule.   

      Ultimately, we feel the safest conclusion is that there was not a statically significant difference between the DHC and kinesin-1 dimers (p = 0.32) but there was a statistically significant difference between both the DHC and kinesin-1 dimers compared to the p50 (p values < 0.001), which was ~50% the intensity of DHC and kinesin-1. Altogether this leads us to the fairly conservative conclusion that DHC puncta contain at least one dimer while the p50 puncta likely contain a single p50-EGFP molecule. 

      (5) The authors suggest the microtubules in the cells treated with SiR-tubulin may be more detyrosinated due to the treatment. Why don't they measure this using well-characterized antibodies that distinguish tyrosinated/detyrosinated microtubules in cells treated or not with SiR-tubulin? 

      At your suggestion, we carried out the experiment and found that under our labeling conditions there was not a notable difference in microtubule detyrosination between DMSO- and SiR-Tubulin-treated cells. Thus, we have removed this caveat from the revised manuscript.

      (6) "While we were unable to assess the relative expression levels of tagged versus untagged DHC for technical reasons." Please describe the technical reasons for the inability to measure DHC expression levels for the reader.

      We made several attempts to quantify the relative amounts of untagged and tagged protein by Western blotting. The high molecular weight of DHC (~500kDa) makes it difficult to resolve it on a conventional mini gel. We attempted running a gradient mini gel (4%-15%), and doing a western blot; however, we were still unable to detect DHC. To troubleshoot, the experiments were repeated with different dilutions of a commercially available antibody and varying concentrations of cell lysate; however, we were unable to obtain a satisfactory result. 

      We hold the view that even if it had it worked it would have been difficult to detect a relatively small difference between the untagged (MW = 500kDa) and tagged DHC (MW = 527kDa) by western blot. We have added language to this effect in the revised manuscript. 

      Reviewer #3 (Public Review):

      (1). CRISPR-edited HeLa clones: 

      (i) The authors indicate that both the DHC-EGFP and p50-EGFP lines are heterozygous and that the level of DHC-EGFP was not measured due to technical difficulties. However, quantification of the relative amounts of untagged and tagged DHC needs to be performed - either using Western blot, immunofluorescence or qPCR comparing the parent cell line and the cell lines used in this work. 

      See response to reviewer 2 above. 

      (ii) The localization of DHC predominantly at the plus tips (Fig. 1A) is at odds with other work where endogenous or close-to-endogenous levels of DHC were visualized in HeLa cells and other non-polarized cells like HEK293, A-431 and U-251MG (e.g.: OpenCell (https://opencell.czbiohub.org/target/CID001880), Human Protein Atlas  ), https://www.biorxiv.org/content/10.1101/2021.04.05.438428v3). The authors should perform immunofluorescence of DHC in the parental cells and DHC-EGFP cells to confirm there are no expression artifacts in the latter. Additionally, a comparison of the colocalization of DHC with EB1 in the parental and DHC-EGFP and p50-EGFP lines would be good to confirm MT plus-tip localisation of DHC in both lines. 

      The microtubule (MT) plus-tip localization of DHC was already observed in the 1990s, as evidenced by publications such as (PMID:10212138) and (PMID:12119357), which were further confirmed by Kobayashi and Murayama  in 2009 (PMID:19915671). We hold the view that further investigation into this localization is not worthwhile since the tip-tracking behavior of DHC-dynactin has been long-established in the field.

      (iii) It would also be useful to see entire fields of view of cells expressing DHC-EGFP and p50EGFP (e.g. in Spinning Disk microscopy) to understand if there is heterogeneity in expression. Similarly, it would be useful to report the relative levels of expression of EGFP (by measuring the total intensity of EGFP fluorescence per cell) in those cells employed for the analysis in the manuscript. 

      Representative images of fields have been added as Figure 1 - figure supplement 1B and Figure 2 – figure supplement 1 in the revised manuscript. We did not see drastic cell-tocell variation of expression within the clonal cell lines.

      (iv) Given that the authors suspect there is differential gene regulation in their CRISPR-edited lines, it cannot be concluded that the DHC-EGFP and p50-EGFP punctae tracked are functional and not piggybacking on untagged proteins. The authors could use the FKBP part of the FKBPEGFP tag to perform knock-sideways of the DHC and p50 to the plasma membrane and confirm abrogation of dynein activity by visualizing known dynein targets such as the Golgi (Golgi should disperse following recruitment of EGFP-tagged DHC-EGFP or p50-EGFP to the PM), or EGF (movement towards the cell center should cease). 

      Despite trying different concentrations and extensive troubleshooting, we were not able to replicate the reported observations of Ciliobrevin D or Dynarrestin during mitosis. We would like to emphasize that the velocity (1.2 μm/s) of dynein-dynactin complexes that we measured in HeLa cells was comparable to those measured in iNeurons by Fellows et al. (PMID: 38407313) and for unopposed dynein under in vitro conditions. 

      (2) TIFRM and analysis: 

      (i) What was the rationale for using TIRFM given its limitation of visualization at/near the plasma membrane? Are the authors confident they are in TIRF mode and not HILO, which would fit with the representative images shown in the manuscript? 

      To avoid overcrowding, it was important to image the MT tracks that that were pinned between the nucleus and the plasma membrane. It is unclear to us why the reviewer feels that true TIRFM could not be used to visualize the movement of dynein-dynactin on this population of MTs since the plasma membrane is ~ 3-5 nm and a MT is ~25-27 nm all of which would fall well within the 100-200 nm excitable range of the evanescent wave produced by TIRF. While we feel TIRF can effectively visualize dynein-dynactin motility in cells, we have mentioned the possibility that some imaging may be HILO microscopy in the materials and methods.

      (ii) At what depth are the authors imaging DHC-EGFP and p50-EGFP? 

      The imaging depth of traditional TIRFM is limited to around 100-200 nm. In adherent interphase HeLa cells the nucleus is in very close proximity (nanometer not micron scale) to the plasma membrane with some cytoskeletal filaments (actin) and microtubules positioned between the plasma membrane and the nuclear membrane. The fact that we were often visualizing MTs positioned between the nucleus and the membrane makes us confident that we were imaging at a depth (100 - 200nm) consistent with TIRFM. 

      (iii) The authors rely on manual inspection of tracks before analyzing them in kymographs - this is not rigorous and is prone to bias. They should instead track the molecules using single particle tracking tools (eg. TrackMate/uTrack), and use these traces to then quantify the displacement, velocity, and run-time. 

      Although automated single particle tracking tools offer several benefits, including reduced human effort, and scalability for large datasets, they often rely on specialized training datasets and do not generalize well to every dataset. The authors contend that under complex cellular environments human intervention is often necessary to achieve a reliable dataset. Considering the nature of our data we felt it was necessary to manually process the time-lapses. 

      (iv) It is unclear how the tracks that were eventually used in the quantification were chosen. Are they representative of the kind of movements seen? Kymographs of dynein movement along an entire MT/cell needs to be shown and all punctae that appear on MTs need to be tracked, and their movement quantified. 

      Considering the densely populated environment of a cell, it will be nearly impossible to quantity all the datasets. We selected tracks for quantification, focusing on areas where MTs were pinned between the nucleus and plasma membrane where we could track the movement of a single dynein molecule and where the surroundings were relatively less crowded. 

      (v) What is the directionality of the moving punctae? 

      In our experience, cells rarely organized their MTs in the textbook radial MT array meaning that one could not confidently conclude that “inward” movements were minus-end directed. Microtubule polarity was also not able to be determined for the MTs positioned between the plasma membrane and the nucleus on which many of the puncta we quantified were moving. It was clear that motile puncta moving on the same MT moved in the same direction with the exception of rare and brief directional switching events. What was more common than directional switching on the same MT were motile puncta exhibiting changes in direction at sharp (sometimes perpendicular) angles indicative of MT track switching, which is a well-characterized behavior of dynein-dynactin (See DOI: 10.1529/biophysj.107.120014).

      (vi) Since all the quantification was performed on SiR tubulin-treated cells, it is unclear if the behavior of dynein observed here reflects the behavior of dynein in untreated cells. Analysis of untreated cells is required. 

      It was important to quantify SiR tubulin-treated cells because SiR-Tubulin is a docetaxel derivative, and its addition suppressed plus-end MT polymerization resulting in a significant reduction in the DHC tip-tracking population and a clearer view of the motile population of MT-associated DHC puncta. Otherwise, it was challenging to reliably identify motile puncta given the abundance of DHC tip-tracking populations in untreated cells.  

      (3) Estimation of stoichiometry of DHC and p50 

      Given that the punctae of DHC-EGFP and p50 seemingly bleach on MT before the end of the movie, the authors should use photobleaching to estimate the number of molecules in their punctae, either by simple counting the number of bleaching steps or by measuring single-step sizes and estimating the number of molecules from the intensity of punctae in the first frame. 

      Comparing the fluorescence intensity of a known molecule (in our case a kinesin-1EGFP dimer) to calculate the numbers of an unknown protein molecule (in our case Dynein or p50) is a widely accepted technique in the field. For example, refer to PMID: 29899040. To accurately estimate the stoichiometry of DHC and p50 and address the concerns raised by other reviewers, we expressed the human kinesin-EGFP in HeLa cells and analyzed the datasets from new experiments. We did not observe any significant differences between our old and new datasets.

      (4) Discussion of prior literature 

      Recent work visualizing the behavior of dyneins in HeLa cells (DOI:  10.1101/2021.04.05.438428), which shows results that do not align with observations in this manuscript, has not been discussed. These contradictory findings need to be discussed, and a more objective assessment of the literature in general needs to be undertaken.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public Review):

      Overall, it's a well-performed study, however, causality between Plscr1 and Ifnlr1 expression needs to be more firmly established. This is because two recent studies of PLSCR1 KO cells infected with different viruses found no major differences in gene expression levels compared with their WT controls (Xu et al. Nature, 2023; LePen et al. PLoS Biol, 2024). There were also defects in the expression of other cytokines (type I and II IFNs plus TNF-alpha) so a clear explanation of why Ifnlr1 was chosen should also be given.

      We appreciate the reviewer’s reference to the two recently published research on PLSCR1’s role in SARS-CoV-2 infections. We have also discussed those studies in the Introduction and Discussion sections of this manuscript. Here, we would like to clarify ourselves for the rationale of investigating Ifn-λr1 signaling.

      The reviewer mentioned “defects in the expression of other cytokines (type I and II IFNs plus TNF-alpha)” and requested a clearer explanation of why Ifnlr1 was chosen for study. In our investigation of IAV infection, we observed no defects in the expression of type I and II IFNs or TNF-α in Plscr1<sup>-/-</sup> mice; rather, these cytokines were expressed at even higher levels compared to WT controls (Figures 2D and 3A). This indicates that the type I and II IFN and TNF-α signaling pathways remain intact and are not negatively affected by the loss of Plscr1. Notably, Ifn-λr1 expression is the only one among all IFNs and their receptors that is significantly impaired in Plscr1<sup>-/-</sup> mice (Figure 3A), justifying our focused investigation of this receptor. To further clarify this point, we have expanded the explanation under the section titled “Plscr1 Binds to Ifn-λr1 Promoter and Activates Ifn-λr1 Transcription in IAV Infection” within the Results. The reviewer noted that previously published studies “found no major differences in gene expression levels compared with their WT controls”, but neither study examined Ifn-λr1 expression.

      (1) The authors propose that Plscr1 restricts IAV infection by regulating the type III IFN signaling pathway. While the data show a positive correlation between Ifnlr1 and Plscr1 levels in both mouse and cell culture models, additional evidence is needed to establish causality between the impaired type III IFN pathway, and the increased susceptibility observed in Plscr1-KO mice. To strengthen this conclusion, the following experiments could be undertaken: (i) Measure IAV titers in WT, Plscr1-KO, Ifnlr1-KO, and Plscr1/ Ifnlr1-double KO cells. If the antiviral activity of Plscr1 is highly dependent on Ifnlr1, there should be no further increase in IAV titers in double KO cells compared to single KO cells; (ii) over-express Plscr1 in Ifnlr1-KO cells to determine if it still inhibits IAV infection. If Plscr1's main action is to upregulate Ifnlr1, then it should not be able to rescue susceptibility since Ifnlr1 cannot be expressed in the KO background. If Plscr1 over-expression rescues viral susceptibility, then there are Ifnlr1-independent mechanisms involved. These experiments should help clarify the relative contribution of the type III IFN pathway to Plscr1-mediated antiviral immunity.

      We agree with the reviewer that additional evidence is necessary to establish causality between the impaired type III IFN pathway and the increased susceptibility observed in Plscr1-KO mice. As requested by the reviewer, and one step further, we have measured IAV titers in Wt, Plscr1<sup>-/-</sup>, Ifn-λr1<sup>-/-</sup>, and Plscr1<sup>-/-</sup>Ifn-λr1<sup>-/-</sup> mouse lungs, which provided us with more comprehensive information at the tissue and organismal level compared to cell culture models. Our results are detailed under “The Anti-Influenza Activity of Plscr1 Is Highly Dependent on Ifn-λr1” within “Results” section and in Supplemental Figure 5. Importantly, there was no further increase in weight loss (Supplemental Figure 5B), total BAL cell counts (Supplemental Figure 5C), neutrophil percentages (Supplemental Figure 5D), and IAV titers (Supplemental Figure 5E) in Plscr1<sup>-/-</sup>Ifn-λr1<sup>-/-</sup> mouse lungs compared to Ifn-λr1<sup>-/-</sup> mouse lungs. These findings indicate that the antiviral activity of Plscr1 is largely dependent on Ifn-λr1.

      We agree that overexpression of Plscr1 on an Ifn-λr1<sup>-/-</sup> background would provide additional evidence to support our conclusion from the Plscr1<sup>-/-</sup>Ifn-λr1<sup>-/-</sup> mice. In future studies, we plan to specifically overexpress Plscr1 in ciliated epithelial cells on the Ifn-λr1<sup>-/-</sup> background by breeding Plscr1<sup>floxStop</sup>Foxj1-Cre<sup>+</sup>Ifn-λr1<sup>-/-</sup> mice. In addition, ciliated epithelial cells isolated from Ifn-λr1<sup>-/-</sup> murine airways could be transduced with a Plscr1 construct for overexpression. We hypothesize that overexpression of Plscr1 in ciliated epithelial cells will not rescue susceptibility in Ifn-λr1<sup>-/-</sup> mice or cells, since our Plscr1<sup>-/-</sup>Ifn-λr1<sup>-/-</sup> mouse model suggest that Ifn-λr1-independent anti-influenza functions of Plscr1 are likely minor compared to its role in upregulating Ifn-λr1. These future plans have been added to the “Discussion” section, and we look forward to presenting our results in a forthcoming publication.

      (3) In Figure 4, the authors demonstrate the interaction between Plscr1 and Ifnlr1. They suggest that this interaction modulates IFN-λ signaling. However, Figures 5C-E show that the 5CA mutant, which lacks surface localization and the ability to bind Ifnlr1, exhibits similar anti-flu activity to WT Plscr1. Does this mean the interaction between Plscr1 and Ifnlr1 is dispensable for Plscr1-mediated antiviral function? Can the authors compare the activation of IFN-λ signaling pathway in Plscr1-KO cells expressing empty vector, WT Plscr1, and 5CA mutant? This could be done by measuring downstream ISG expression or using an ISRE-luciferase reporter assay upon IFN-λ treatment.

      We agree with the reviewer that downstream activation of the IFN-λ signaling pathway is a critical component of the proposed regulatory role of PLSCR1. As suggested, we attempted to perform an ISRE-luciferase reporter assay following IFN-λ treatment in PLSCR1 rescue cell lines by transfecting the cells with hGAPDH-rLuc (Addgene #82479) and pGL4.45 [luc2P/ISRE/Hygro] (Promega #E4041).

      Despite extensive efforts over several months, we were unable to achieve expression of pGL4.45 [luc2P/ISRE/Hygro] in PLSCR1 rescue cells using either Lipofectamine 3000 or electroporation, as no firefly luciferase activity was detected at baseline or following IFN-λ treatment. In contrast, hGAPDH-rLuc was robustly expressed in these cells.

      The pGL4.45 [luc2P/ISRE/Hygro] plasmid was obtained directly from Promega as a purified product, and its sequence was confirmed via whole plasmid sequencing. Additionally, both hGAPDH-rLuc and pGL4.45 [luc2P/ISRE/Hygro] were successfully expressed in 293T cells, indicating that neither the plasmids nor the transfection protocols are inherently faulty.

      We suspect that prior modifications to the PLSCR1 rescue cells—such as CRISPR-mediated knockout and lentiviral transduction—may interfere with successful transfection of pGL4.45 [luc2P/ISRE/Hygro] through an as-yet-unknown mechanism. Although these results are disappointing, we will continue troubleshooting and plan to communicate in a separate manuscript once the luciferase assay is successfully established.

      Reviewer #1 (Recommendations):

      (1) In the introduction, the linkage between the paragraph discussing type III IFN and PLSCR1 needs to be better established. The mention of PLSCR1 being an ISG at the outset may help connect these two paragraphs and make the text appear more logical.

      We apologize for the lack of linkage and logic between type 3 IFN and PLSCR1. We have introduced PLSCR1 as an ISG at the beginning of its paragraph as recommended. 

      (2) The statement that, “Intriguingly, PLSCR1 is also an antiviral ISG, as its expression can be highly induced by type 1 and 2 interferons in various viral infections[15, 16]. However, whether its expression can be similarly induced by type 3 interferon has not been studied yet.” is incorrect. Xu et al. tested the role of PLSCR1 in type III IFN-induced control of SARS-CoV-2 (ref. 24). This needs to be revised.

      We apologize for the incorrect information in the introduction and have revised the paragraph with the proper citation.

      (3) In Figure 3B, can the authors provide a comprehensive heatmap that includes all ISGs above the threshold, rather than only a subset? This would offer a more complete overview of the changes in type I, II, and III IFN pathways in Plscr1-KO mice.

      As suggested by the reviewer, we have provided a comprehensive heatmap that includes all ISGs above the threshold in Figure 3C (previously Figure 3B). We identified a total of 1,113 ISGs in our dataset with a fold change ≥2. Enlarged heatmaps with gene names are provided in Supplemental Figure 1. Among those ISGs, 584 are regulated exclusively by type 1 IFNs, and 488 are regulated by both type 1 and type 2 interferons. Unfortunately, the Interferome database does not include information on type 3 IFN-inducible genes in mice[1]. Although many ISGs were robustly upregulated in Plscr1<sup>-/-</sup> infected lungs, consistent with inflammation data, a large subset of ISGs failed to be transcribed when Ifn-λr1 function was impaired, especially at 7 dpi. We suspect that those non-transcribed ISGs in Plscr1<sup>-/-</sup> mice may be specifically regulated by type 3 IFN and represent interesting targets for future research. These results have been added to “Plscr1 Binds to Ifn-λr1 Promoter and Activates Ifn-λr1 Transcription in IAV Infection” within “Results” section.

      (4) In Figure 3C, 5B and 7H, immunoblots should also be included to measure changes of Ifnlr1/IFNLR1 protein level.

      As requested by the reviewer, we have provided western blots measuring Ifn-λr1/IFN-λR1 protein level in Figure 5B and 7I. The protein expressions were consistent with the PCR results.

      (5) In Figure 3H, the amount of RPL30 is also low in the anti-PLSCR1-treated and IgG samples, making it difficult to estimate if ChIP binding is genuinely impacted.

      RPL30 Exon 3 serves as a negative control in the ChIP experiment and is not expected to bind either the anti-PLSCR1-treated or the IgG control samples. Anti-Histone H3 treatment is a positive control, with the treated sample expected to show binding to RPL30 Exon 3. We hope this clarification has addressed any further potential confusion from the reviewer.

      (6) In Figure 4A, can the authors show a larger slice of the gel with molecular weight markers for both Plscr1 and Ifnlr1. In the coIP, the binding may be indirect through intermediate partners. Proximity ligation assay is a more direct assay for interaction and can be stated as such.

      As suggested by the reviewer, we have included whole gel images of Figure 4A with molecular weight markers for both Plscr1 and Ifnlr1 in Supplemental Figure 3. We appreciate the reviewer’s affirmation of proximity ligation assay and have stated it as a more direct assay for interaction under “Plscr1 Interacts with Ifn-λr1 on Pulmonary Epithelial Cell Membrane in IAV Infection” in “Results” section.

      (7) In Figure 5A, how is the expression of PLSCR1 WT and mutants driven by an EF-1α promoter can be further upregulated by IAV infection? Can the authors also use immunoblots to examine the protein level of PLSCR1?

      We apologize for the confusion and appreciate the reviewer’s careful observation. We were initially surprised by this finding as well, but upon further investigation, we found out that the human PLSCR1 primers used in our qRT-PCR assay can still detect the transcription from the undisturbed portion of the endogenous PLSCR1 mRNA, even in PLSCR1<sup>-/-</sup> cells. In the original Figure 5A, data for vector-transduced PLSCR1<sup>-/-</sup> were not included because PCR was not performed on those samples at the time. After conducting PCR for vector-transduced PLSCR1<sup>-/-</sup> cells, we detected transcription of PLSCR1, which confirms that the signaling originates from endogenous DNA, but not from the EF-1α promoter-driven PLSCR1 plasmid. Please see Author response image 1 below.

      Author response image 1.

      The forward human PLSCR1 primer we used matches 15-34 nt of Wt PLSCR1, and the reverse primer matches 224-244 nt of Wt PLSCR1. CRISPR-Cas9 KO of PLSCR1 was mediated by sgRNAs in A549 cells and was performed by Xu et al[2]. sgRNA #1 matches 227-246 nt, sgRNA #2 matches 209-228 nt, and sgRNA #3 matches 689-708 nt of Wt PLSCR1. The sgRNAs likely introduced a short deletion or insertion that does not affect transcription. However, those endogenous mRNA transcripts cannot be translated to functional and detectable PLSCR1 proteins, as validated by our western blot (below), as well as western blots performed by Xu et al[2]. Therefore, our primers could amplify endogenous PLSCR1 transcripts upregulated by IAV infection, if 15-244 nt was not disturbed by CRISPR-Cas9 KO. By western blot, we confirmed that only endogenous PLSCR1 expression is upregulated by IAV infection, and exogenous protein expression of PLSCR1 plasmids driven by an EF-1α promoter are not upregulated by IAV infection.

      Author response image 2.

      To avoid confusion, we have removed the original Figure 5A from the manuscript.

      (8) In Figure 5C, the loss of anti-flu activity with the H262Y mutant is modest, suggesting the loss of ifnlr1 transcription is only partly responsible for the susceptibility of Plscr1 KO cells. The anti-flu activity being independent of scramblase activity resembles the earlier discovery of SARS-CoV-2 (Xu et al., 2024). This could be stated in the results since it is an important point that scramblase activity is dispensable for several major human viruses and shifts the emphasis regarding mechanism. It has been appropriately noted in the discussion.

      We appreciated the comments and have acknowledged the consistency of our results with those of Xu et al. under “Both Cell Surface and Nuclear PLSCR1 Regulates IFN-λ Signaling and Limits IAV Infection Independent of Its Enzymatic Activity” in the “Results” section.

      Reviewer #2 (Recommendations):

      (1) The statement that type I interferons are expressed by “almost all cells” is inaccurate (line 61). Type I IFN production is also context-dependent and often restricted to specific cell types upon infection or stimulation.

      We apologize for the inaccurate description of the expression pattern of type 1 IFNs and have corrected the restricted cellular sources of type 1 IFNs in the “Introduction”.

      (2) The antiviral response is assessed solely through flu M gene expression. Incorporating infectious virus titers (e.g., TCID50 or plaque assay) would provide a more robust and direct measure of antiviral activity.

      As requested by the reviewer, we have performed plaque assays on all experiments where flu M gene expression levels were measured (Figure 1G, 5E and 7F, and Supplemental Figure 6E). The plaque assay results are consistent with the flu M gene expressions.

      (3) While mRNA expression of interferons is measured, protein levels (e.g., through ELISA) should also be quantified to establish the functional relevance of IFN expression changes.

      As requested by the reviewer, we have quantified the protein level of IFN-λ in mouse BAL with ELISA (Figure 2E). The ELISA results are consistent with the mRNA expressions of IFN-λ.

      (4) It is unclear whether reduced IFNLR1 expression translates to defective downstream signaling or antiviral responses after IFN-λ treatment in PLSCR1-deficient cells. This is particularly pertinent given the increase in IFN-λ ligand in vivo, which might compensate for receptor downregulation.

      We agree with the reviewer that downstream activation of the IFN-λ signaling pathway is a critical aspect of PLSCR1’s proposed regulatory role. To investigate this, we attempted an ISRE-luciferase reporter assay to assess downstream signaling following IFN-λ treatment in PLSCR1 rescue cells. Unfortunately, the experiment encountered unforeseen technical issues. For additional context, please refer to our response to Reviewer #1’s public review #3.

      (5) Detailed gating strategies for immune cell subsets are absent and should be included for clarity and reproducibility.

      We would like to clarify that the immune cell subsets in BAL fluids were counted manually following cytospin preparation and Diff-Quik staining (Figure 2B and 7H, and Supplemental Figures 2C, 5D, and 8D), rather than by flow cytometry. We hope this resolves the reviewer’s confusion.

      (6) The study does not definitively establish that reduced IFN-λ signaling causes the observed in vivo phenotype. Increased morbidity and mortality in PLSCR1-deficient mice could also stem from elevated TNF-α levels and lung damage, as proinflammatory cytokines and/or enhanced lung damage are known contributors to influenza morbidity and mortality. This point warrants detailed discussions.

      We agreed with the reviewer that this study does not guarantee a definitive causality between reduced IFN-λ signaling and increased morbidity of Plscr1<sup>-/-</sup> mice and more experiments are needed to reach the conclusion. We have acknowledged this limitation of our study in the “Discussion”, as requested by the reviewer. We hope to fully eliminate the confounding elements and definitively establish the proposed causality in future studies.

      Reviewer #3 (Public review):

      Summary:

      Yang et al. have investigated the role of PLSCR1, an antiviral interferon-stimulated gene (ISG), in host protection against IAV infection. Although some antiviral effects of PLSCR1 have been described, its full activity remains incompletely understood.

      This study now shows that Plscr1 expression is induced by IAV infection in the respiratory epithelium, and Plscr1 acts to increase Ifn-λr1 expression and enhance IFN-λ signaling possibly through protein-protein interactions on the cell membrane.

      Strengths:

      The study sheds light on the way Ifnlr1 expression is regulated, an area of research where little is known. The study is extensive and well-performed with relevant genetically modified mouse models and tools.

      Weaknesses:

      There are some issues that need to be clarified/corrected in the results and figures as presented.

      Also, the study does not provide much information about the role of PLSCR1 in the regulation of Ifn-λr1 expression and function in immune cells. This would have been a plus.

      We would like to thank the reviewer for the positive feedback and insightful comment regarding the roles of PLSCR1 and IFN-λR1 in immune cells. It is important to note that IFN-λR1 expression is highly restricted in immune cells and is primarily limited to neutrophils and dendritic cells[3]. While dendritic cells were not the focus of this study, we did examine all immune cell subsets in our single cell RNA seq data and performed infection experiments in Plscr1<sup>floxStop</sup>/LysM-Cre<sup>+</sup> mice. We have not observed any significant findings in these populations. On the other hand, we do have some interesting preliminary data suggesting a role for PLSCR1 in regulating Ifn-λr1 expression and function in neutrophils. These findings are discussed in detail in our response to reviewer #3’s recommendation #12.

      Reviewer #3 (Recommendations):

      (1) In Figure 1B, the Plscr1 label should be moved to the y-axis so that readers don't confuse it with the Plscr1-/- mice used in the other figure panels. The fact that WT mice were used should be added in the figure legend.

      We apologize for the confusion in the figures. We have moved Plscr1 label to the y-axis in Figure 1B and have mentioned Wt mice were used in the figure legend.

      (2) In Figure 1C and D, the type of dose leading to the presented data should be added to help the reader. Also, shouldn't statistics be added?

      We appreciate the suggestion and have added doses to Figure 1C and 1D. We are confused about the request of adding statistics by the reviewer, as two-way ANOVA tests were used to compare weight losses, and the significance was labeled on the figures.

      (3) In Figures 1, F, and G, it is not indicated whether sublethal or lethal dose was used for the IAV infection. This should be very clear in the figure and figure legend.

      We apologize for the confusion of infection doses used in the figures. We have added doses to Figure 1F, 1G and 1H.

      (4) In Figure 1, the CTCF abbreviation should be explained in the Figure legend.

      We have explained CTCF in the figure legend as requested.

      (5) In Figure 2B, this is percentages of what?

      Figure 2B shows the percentages of each immune cell type within total BAL cells.

      (6) In Figures 3A and B, transcriptomes for each condition are from how many mice? Also, what do heatmaps show? Fold induction, differences, etc, and from what? What is compared with what? In addition, is there a discordance between the RNAseq data of Figure 3A and the qPCR data of Fig. 3C in terms of Ifnlr1 expression?

      In Figure 3A and 3C (previously 3B), RNA from the whole lungs of 9 mice per PBS-treated group and 4 mice per IAV-infected group were pooled for transcriptomic analysis. Figure 3A represents a heatmap of differential gene expression, while Figure 3C (previously 3B) represents fold changes in gene expression relative to uninfected controls. In both heatmaps, gene expression values are color-coded from row minimum (blue) to row maximum (red), enabling comparison across groups within each gene (row). The major comparison of interest in these heatmaps is between Wt infected mice versus Plscr1<sup>-/-</sup> infected mice. We have added this information to the figure legend.

      We also acknowledge the reviewer’s observation regarding the discordance between the RNA seq data of Figure 3A and the qPCR data of Figure 3B (previously 3C) for Ifnlr1 expression. To address this, we have repeated the qRT-PCR experiment with additional samples at 7 dpi. In the updated results, Wt mice consistently show significantly higher Ifn-λr1 expression than Plscr1<sup>-/-</sup> infected mice at both 3 dpi and 7 dpi, consistent with the RNA seq data. However, a time-dependent discrepancy between the RNA-seq and qRT-PCR datasets remains: Ifn-λr1 expression continues to increase at 7 dpi in the RNA-seq data (Figure 3A), whereas it declines in the qRT-PCR results (Figure 3B). The reason for this discrepancy remains unclear and has been addressed in the Discussion section.

      (7) In Figure 3D, have the authors checked whether the Ifnlr1 antibody they use is indeed specific for Ifnlr1? Have they used any blocking peptide for the anti-mouse Ifn-λr1 polyclonal antibody they are using? Also, in Figure 3E, the marker used for staining should be indicated in the pictures of the lung section.

      Unfortunately, a blocking peptide is not available for the anti-mouse Ifn-λr1 polyclonal antibody used in our study. To assess antibody specificity, we have performed immunofluorescence staining of Ifn-λr1 on lung tissues from Ifn-λr1<sup>-/-</sup> mice using the same antibody. No signal was detected (Supplemental Figure 5A), supporting the specificity of the antibody for Ifn-λr1.

      As requested by the reviewer, we have added the marker (Ifn-λr1) to the pictures of the lung section in Figure 3E.

      (8) In Figure 5, it's better to move each graph's label that stands to the top (e.g. PLSCR1, IFN-λR1 etc) to the y-axis label so that it doesn't get confused with the mouse -/- label.

      We apologize for the confusion and have moved the top label to the y-axis in Figure 5.

      (9) In Figure 6A, it is claimed that the 'two-dimensional UMAP demonstrated that these main lung cell populations (epithelial, endothelial, mesenchymal, and immune) were dynamic over the course of infection.'. This is not clear by the data. The percentage of cells per cluster should be calculated.

      As requested by the reviewer, the proportion (Supplemental Figure 6A) and cell count (Supplemental Figure 6B) of each cluster have been calculated and included in “PLSCR1 Expression Is Upregulated in the Ciliated Airway Epithelial Compartment of Mice following Flu Infection” under “Results” section. Together with the two-dimensional UMAP (Figure 6A), these data demonstrate that the main lung cell populations (epithelial, endothelial, mesenchymal, and immune) were dynamic over the course of infection. Following infection, many populations emerged, particularly within the immune cell clusters. At the same time, some clusters were initially depleted and later restored, such as microvascular endothelial cells (cluster 2). Other populations, such as interferon-responsive fibroblasts (cluster 20), showed a dramatic yet transient expansion during acute infection and disappeared after infection resolved.

      (10) In Figure 6 B and C, the legend should indicate that these are Violin plots. Also, if AT2 cells don't express Plscr1, does that indicate that in these cells Plscr1 is not needed for IFN-λR1 expression?

      As requested, we have indicated in the legend of Figure 6B and 6C that these are violin plots. Plscr1 is expressed at low levels in AT2 cells. However, it is unclear whether Plscr1 is needed for Ifn-λr1 expression in AT2 cells, and it would be interesting to investigate further.

      (11) In lines 302-304, it is stated that 'Among the various epithelial populations, ciliated epithelial cells not only had 303 the highest aggregated expression of Plscr1, but also were the only epithelial cell 304 population in which significantly more Plscr1 was induced in response to IAV infection.'. Which data/ figure support this statement?

      Figure 6B shows that among the various epithelial populations, ciliated epithelial cells had the highest aggregated expression of Plscr1. To better illustrate this statement, we have rearranged the order of cell clusters from highest to lowest Plscr1 expression, and added red dots to indicate the mean expression levels for each cluster in Figure 6B.

      Ciliated epithelial cells also had the most significant increase in Plscr1 expression (p < 2.22e-16 and p = 6.7e-05) in early IAV infection at 3 dpi (Figure 6C and Supplemental Figure 7A-7K). In comparison, AT1 cells were the only other epithelial cluster to show Plscr1 upregulation at 3dpi, but to a much less extent (p = 0.033, Supplemental Figure 7J). Supplemental Figure 7 was added to better support the statement and the explanation was added to “PLSCR1 Expression Is Upregulated in the Ciliated Airway Epithelial Compartment of Mice following Flu Infection” under “Results” section.

      (12) As earlier, if Plscr1 is not expressed in neutrophils (Figure 6F), does that mean IFN-λR1 expression does not require Plscr1 in these cells?

      Although Plscr1 is expressed at lower levels in neutrophils compared to epithelial cells, it is still detectable. In fact, our preliminary data suggest that IFN-λR1 expression in neutrophils is dependent on Plscr1. We have isolated neutrophils from peripheral blood and BAL of IAV-infected Wt and Plscr1<sup>-/-</sup> mice using a mouse neutrophil enrichment kit. Quantitative PCR results showed that Plscr1<sup>-/-</sup> neutrophils exhibit significantly lower expression of Ifn-λr1, alongside elevated levels of Il-1β, Il-6 and Tnf-α in IAV infection (see figures below). These findings suggest that Plscr1 may play an anti-inflammatory role in neutrophils by upregulating Ifn-λr1. These data were not included in the current manuscript because they are beyond the scope of current study, but we hope to address the role of PLSCR1 in regulating IFN-λR1 expression and function in neutrophils in a future study.

      Author response image 3.

      (13) The Figure 7A legend is not well stated. Something like ' Schematic representation of the experimental design of...' should be included. Also, Figure 7J is not referenced in the text.

      We apologize for the unclear Figure 7A legend and have changed it to “Schematic representation of the experimental design of ciliated epithelial cell conditional Plscr1 KI mice.” Figure 8 (previously Figure 7J) has now been referenced in the text.

      (14) In the Methods, more specific information in some parts should be provided. For example, the clones of the antibodies used should be included.

      Apart from the 10x technology, the kits used and the type of the Illumina sequencing should be provided. Information on how the QC was performed (threshold for reads/cell, detected genes/per cells, and % of mitochondrial genes etc) should be added.

      We apologize for the missing information in the “Methods”. We have now provided the clones of the antibodies used, the kit used to generate single-cell transcriptomic libraries, the type of the Illumina sequencing, and the QC performance data.

      References

      (1) Rusinova, I., et al., Interferome v2.0: an updated database of annotated interferon-regulated genes. Nucleic Acids Res, 2013. 41(Database issue): p. D1040-6.

      (2) Xu, D., et al., PLSCR1 is a cell-autonomous defence factor against SARS-CoV-2 infection. Nature, 2023. 619(7971): p. 819-827.

      (3) Donnelly, R.P., et al., The expanded family of class II cytokines that share the IL-10 receptor-2 (IL-10R2) chain. J Leukoc Biol, 2004. 76(2): p. 314-21.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Here the authors discuss mechanisms of ligand binding and conformational changes in GlnBP (a small E Coli periplasmic binding protein, which binds and carries L-glutamine to the inner membrane ATP-binding cassette (ABC) transporter). The authors have distinguished records in this area and have published seminal works. They include experimentalists and computational scientists. Accordingly, they provide comprehensive, high-quality, experimental and computational work. They observe that apo- and holo- GlnBP does not generate detectable exchange between open and (semi-) closed conformations on timescales between 100 ns and 10 ms. Especially, the ligand binding and conformational changes in GlnBP that they observe are highly correlated. Their analysis of the results indicates a dominant induced-fit mechanism, where the ligand binds GlnBP prior to conformational rearrangements. They then suggest that an approach resembling the one they undertook can be applied to other protein systems where the coupling mechanism of conformational changes and ligand binding. They argue that the intuitive model where ligand binding triggers a functionally relevant conformational change was challenged by structural experiments and MD simulations revealing the existence of unliganded closed or semi-closed states and their dynamic exchange with open unbound conformations, discuss alternative mechanisms that were proposed, their merits and difficulties, concluding that the findings were controversial, which, they suggest is due to insufficient availability of experimental evidence to distinguish them. As to further specific conclusions they draw from their results, they determine that a conformational selection mechanism is incompatible with their results, but induced fit is. They thus propose induced fit as the dominant pathway for GlnBP, further supported by the notion that the open conformation is much more likely to bind substrate than the closed one based on steric arguments. Considering the landscape of substrate-free states, in my view, the closed state is likely to be the most stable and, thus most highly populated. As the authors note and I agree that state can be sterically infeasible for a deep-pocketed substrate. As indeed they also underscore, there is likely to be a range of open states. If the populations of certain states are extremely low, they may not be detected by the experimental (or computational) methods. The free energy landscape of the protein can populate all possible states, with the populations determined by their relative energies. In principle, the protein can visit all states. Whether a particular state is observed depends on the time the protein spends in that state. The frequencies, or propensities, of the visits can determine the protein function. As to a specific order of events, in my view, there isn't any. It is a matter of probabilities which depend on the populations (energies) of the states. The open conformation that is likely to bind is the most favorable, permitting substrate access, followed by minor, induced fit conformational changes. However, a key factor is the ligand concentration. Ligand binding requires overcoming barriers to sustain the equilibrium of the unliganded ensemble, thus time. If the population of the state is low, and ligand concentration is high (often the case in in vitro experiments, and high drug dosage scenarios) binding is likely to take place across a range of available states. This is however a personal interpretation of the data. The paper here, which clearly embodies massive careful, and high-quality work, is extensive, making use of a range of experimental approaches, including isothermal titration calorimetry, single-molecule Förster resonance energy transfer, and surface-plasmon resonance spectroscopy. The problem the authors undertake is of fundamental importance.

      Reviewer #2 (Public Review):

      The manuscript by Han et al and Cordes is a tour-de-force effort to distinguish between induced fit and conformational selection in glutamine binding protein (GlnBP). 

      We thank the referee for the recognition of the work and effort that has gone into this manuscript. 

      It is important to say that I don't agree that a decision needs to be made between these two limiting possibilities in the sense that whether a minor population can be observed depends on the experiment and the energy difference between the states. That said, the authors make an important distinction which is that it is not sufficient to observe both states in the ligand-free solution because it is likely that the ligand will not bind to the already closed state. The ligand binds to the open state and the question then is whether the ligand sufficiently changes the energy of the open state to effectively cause it to close. The authors point out that this question requires both a kinetic and a thermodynamic answer. Their "method" combines isothermal titration calorimetry, single-molecule FRET including key results from multi-parameter photon-by-photon hidden Markov modelling (mpH2MM), and SPR. The authors present this "method" of combination of experiments as an approach to definitively differentiate between induced fit and conformational selection. I applaud the rigor with which they perform all of the experiments and agree that others who want to understand the exact mechanism of protein conformational changes connected to ligand binding need to do such a multitude of different experiments to fully characterize the process. However, the situation of GlnBP is somewhat unique in the high affinity of the Gln (slow offrate) as compared to many small molecule binding situations such as enzyme-substrate complexes. It is therefore not surprising that the kinetics result in an induced fit situation. 

      For us these comments are an essential part of the conceptual aspects of our work and the resulting research. From a descriptive viewpoint, it is essential for us (and we tried to further highlight and stress this in the updated version of our paper) that IF and CS are two kinetic mechanisms of ligand binding. They imply – if active in a biomolecular system – a temporal order and timescale separation of ligand binding and conformational changes. Since we found many conflicting results for the binding mechanism of GlnBP, but also other SPBs, we decided to assess the situation in GlnBP. 

      In the case of the E-S complexes I am familiar with, the dissociation is much more rapid because the substrate binding affinity is in the micromolar range and therefore the re-equilibration of the apo state is much faster. In this case, the rate of closing and opening doesn't change much whether ligand is present or not. Here, of course, once the ligand is bound the re-equilibration is slow. Therefore, I am not sure if the conclusions based on this single protein are transferrable to most other protein-small molecule systems. 

      We do not argue that our results and interpretations are valid for most other protein-ligand systems may those be enzymes or simple ligand binders. Yet, based on the conservation of ABC-related SBPs and the fact that quite a few of them show sub-µM Kds, we render it likely to find many analogous situations as for GlnBP also based on our previous results e.g., from de Boer et al., eLife (2019).

      I am also not sure if they are transferrable to protein-protein systems where both molecules the ligand and the receptor are expected to have multiscale dynamics that change upon binding.

      As we argue above the two mechanisms IF/CS imply a clear temporal order and separation of timescales for ligand binding and conformational changes. These mechanisms are simple and extreme cases that we tested before more complex kinetic schemes are inferred for the description of ligand binding and conformational changes (which might not be necessary). 

      Strengths:

      The authors provide beautiful ITC data and smFRET data to explore the conformational changes that occur upon Gln binding. Figure 3D and Figure 4 (mpH2MM data) provide the really critical data. The multi-parameter photon-by-photon hidden Markov modelling (mpH2MM) data. In the presence of glutamine concentrations near the Kd, two FRET-active sub-populations are identified that appear to interconvert on timescales slower than 10 ms. They then do a whole bunch of control experiments to look for faster dynamics (Figure 5). They also do TIRF smFRET to try to compare their results to those of previous publications. Here, they find several artifacts are occurring including inactivation of ~50% of the proteins. They also perform SPR experiments to measure the association rate of Gln and obtain expectedly rapid association rates on the order of 10<sup>^</sup>8 M-1s-1.

      Thank you.  

      Weaknesses:

      Looking at the traces presented in the supplementary figures, one can see that several of the traces have more than one molecule present. The authors should make sure that they use only traces with a single photobleaching event for each fluorophore. One can see steps in some of the green traces that indicate two green fluorophors (likely from 2 different molecules) in the traces. This is one of the frequent problems with TIRF smFRET with proteins, that only some of the spots represent single molecules and the rest need to be filtered out of the analysis.

      We have inspected all TIRF data provided with the manuscript and assume that the referee refers to data shown in current Appendix Figure 4/5. We agree that those traces in which no photo bleaching occurs could potentially be questioned, yet they would not change our interpretations and thus decided to leave the figure as is.

      The NMR experiments that the authors cite are not in disagreement with the work presented here. NMR is capable of detecting "invisible states" that occur in 1-5% of the population. SmFRET is not capable of detecting these very minor states. I am quite sure that if NMR spectroscopists could add very high concentrations of Gln they would also see a conversion to the closed population.

      We agree with the referee that NMR is capable of detecting invisible states that occur in 1-5% of the population (see e.g., the paper cited in our manuscript by Tang, C et al., Open-to-closed transition in apo maltose-binding protein observed by paramagnetic NMR. Nature 2007, 449, 1078). Yet, we see a strong disagreement between our work and papers on GlnBP, where a combination of NMR, FRET and MD was used (Feng, Y. et al., Conformational Dynamics of apo‐GlnBP Revealed by Experimental and Computational Analysis. Angewandte Chemie 2016, 55, 13990; Zhang, L. et al., Ligand-bound glutamine binding protein assumes multiple metastable binding sites with different binding affinities. Communications biology 2020, 3, 1). These inconsistencies were also noted by others in the field (Kooshapur, H. et al., NMR Analysis of Apo Glutamine‐Binding Protein Exposes Challenges in the Study of Interdomain Dynamics. Angewandte Chemie 2019, 58, 16899) and we reemphasize that this latest NMR publication comes to similar conclusions as we present in our manuscript.   

      Reviewer #1 (Recommendations For The Authors):

      The paper embodies massive careful and high-quality work, and is extensive, making use of a range of experimental approaches, including isothermal titration calorimetry, single-molecule Förster resonance energy transfer, and surface-plasmon resonance spectroscopy. Considering this extensiveness, I do not see what more the authors can do.

      We very much appreciate the assessment and positive comments of the referee, but still tried to incorporate simulation data to support our interpretations.

      Reviewer #2 (Recommendations For The Authors):

      (1) Looking at the traces presented in the supplementary figures, one can see that several of the traces have more than one molecule present. The authors should make sure that they use only traces with a single photobleaching event for each fluorophore. One can see steps in some of the green traces that indicate two green fluorophors (likely from 2 different molecules) in the traces. This is one of the frequent problems with TIRF smFRET with proteins, that only some of the spots represent single molecules and the rest need to be filtered out of the analysis.

      See response above for iteration of TIRF data selection and analysis.

      (2) The NMR experiments that the authors cite are not in disagreement with the work presented here. NMR is capable of detecting "invisible states" that occur in 1-5% of the population. SmFRET is not capable of detecting these very minor states. I am quite sure that if NMR spectroscopists could add very high concentrations of Gln they would also see a conversion to the closed population.

      See response above.

      Minor point:

      (1) It is difficult to see what is going on between apo and holo in Figure 1B. Could the authors make Figure 1a, 1b apo, and 1b holo in the same orientation (by aligning D2 or D1 to each other in all figures) so one can see which helices are in the same place and which have moved?

      We respectfully disagree and decided to keep this figure as it is

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This study focuses on the bacterial metabolite TMA, generated from dietary choline. These authors and others have previously generated foundational knowledge about the TMA metabolite TMAO, and its role in metabolic disease. This study extends those findings to test whether TMAO's precursor, TMA, and its receptor TAAR5 are also involved and necessary for some of these metabolic phenotypes. They find that mice lacking the host TMA receptor (Taar5-/-) have altered circadian rhythms in gene expression, metabolic hormones, gut microbiome composition, and olfactory and innate behavior. In parallel, mice lacking bacterial TMA production or host TMA oxidation have altered circadian rhythms.

      Strengths:

      These authors use state-of-the-art bacterial and murine genetics to dissect the roles of TMA, TMAO, and their receptor in various metabolic outcomes (primarily measuring plasma and tissue cytokine/gene expression). They also follow a unique and unexpected behavioral/olfactory phenotype. Statistics are impeccable.

      Weaknesses:

      Enthusiasm for the manuscript is dampened by some ambiguous writing and the presentation of ideas in the introduction, both of which could easily be improved upon revision.

      We apologize for the abbreviated and ambiguous writing style in our original submission. Given Reviewer 2 also suggested reorganizing and rewriting certain parts, we have spent time to remove ambiguity by adding additional points of clarification and adding more historical context to justify studying TMA-TAAR5 signaling in regulating host circadian rhythms. We have also reorganized the presentation of data aligned with this.

      Reviewer #2 (Public review):

      Summary:

      In the manuscript by Mahen et al., entitled "Gut Microbe-Derived Trimethylamine Shapes Circadian Rhythms Through the Host Receptor TAAR5," the authors investigate the interplay between a host G protein-coupled receptor (TAAR5), the gut microbiota-derived metabolite trimethylamine (TMA), and the host circadian system. Using a combination of genetically engineered mouse and bacterial models, the study demonstrates a link between microbial signaling and circadian regulation, particularly through effects observed in the olfactory system. Overall, this manuscript presents a novel and valuable contribution to our understanding of hostmicrobe interactions and circadian biology. However, several sections would benefit from improved clarity, organization, and mechanistic depth to fully support the authors' conclusions.

      Strengths:

      (1) The manuscript addresses an important and timely topic in host-microbe communication and circadian biology.

      (2) The studies employ multiple complementary models, e.g., Taar5 knockout mice, microbial mutants, which enhance the depth of the investigation.

      (3) The integration of behavioral, hormonal, microbial, and transcript-level data provides a multifaceted view of the observed phenotype.

      (4) The identification of olfactory-linked circadian changes in the context of gut microbes adds a novel perspective to the field.

      Weaknesses:

      While the manuscript presents compelling data, several weaknesses limit the clarity and strength of the conclusions.

      (1) The presentation of hormonal, cytokine, behavioral, and microbiome data would benefit from clearer organization, more detailed descriptions, and functional grouping to aid interpretation.

      We appreciate this comment and have reorganized the data to improve functional grouping and readability. We have also added additional detail to descriptions of the data in the revised figure legends and results.

      (2) Some transitions-particularly from behavioral to microbiome data-are abrupt and would benefit from better contextual framing.

      We agree with this comment, and have added additional language to provide smoother transitions. This in many cases brings in historical context of why we focused on both behavioral and microbiome alterations in this body of work.

      (3) The microbial rhythmicity analyses lack detail on methods and visualization, and the sequencing metadata (e.g., sample type, sex, method) are not clearly stated.

      We apologize for this, and have now added more detail in our methods, figures, and figure legends to ensure the reader can easily understand sample type, sex, and the methods used. 

      (4) Several figures are difficult to interpret due to dense layouts or vague legends, and key metabolites and gene expression comparisons are either underexplained or not consistently assessed across models.

      Aligned with the last comment we now added more detail in our methods, figures, and figure legends to provide clear information. We have now provided additional data showing the same key metabolites, hormones, and gene expression alterations in each model if the same endpoints were measured.

      (5) Finally, while the authors suggest a causal role for TAAR5 and its ligand in circadian regulation, the current data remain correlative; mechanistic experiments or stronger disclaimers are needed to support these claims.

      We agree with this comment, and as a result have removed any language causally linking TMA and TAAR5 together in circadian regulation. Instead, we only state finding in each model and refrain from overinterpreting.

      Reviewer #3 (Public review):

      Summary:

      Deletion of the TMA-sensor TAAR5 results in circadian alterations in gene expression, particularly in the olfactory bulb, plasma hormones, and neurobehaviors.

      Strengths:

      Genetic background was rigorously controlled.

      Comprehensive characterization.

      Weaknesses:

      The weaknesses identified by this reviewer are minor.

      Overall, the studies are very nicely done. However, despite careful experimentation, I note that even the controls vary considerably in their gene expression, etc, across time (eg, compare control graphs for Cry 1 in IB, 4B). It makes me wonder how inherently noisy these measurements are. While I think that the overall point that the Taar5 KO shows circadian changes is robust, future studies to dissect which changes are reproducible over the noise would be helpful.

      We thank the reviewer for this insightful comment. We completely agree that there are clear differences in the circadian data in experiments from Taar5<sup>-/-</sup> mice and those from gnotobiotic mice where we have genetically deleted CutC. Although the data from Taar5<sup>-/-</sup> mice show nice robust circadian rhythms, the data from mice where microbial CutC is altered have inherently more “noise”. We attribute some of this to the fact that the Taar5<sup>-/-</sup> mouse experiment have a fully intact and diverse gut microbiome . Whereas, the gnotobiotic study with CutC manipulation includes only a 6 member microbiome community that does not represent the normal microbiome diversity in the gut. This defined synthetic community was used as a rigorous reductionist approach, but likely affected the normal interactions between a complex intact gut microbiome and host circadian rhythms. We have added some additional discussion to indicate this in the limitations section of the manuscript.

      Impact:

      These data add to the growing literature pointing to a role for the TMA/TMAO pathway in olfaction and neurobehavioral.

      Reviewer #1 (Recommendations for the authors):

      I suggest a revision of the writing and organization. The potential impact of the study after reading the introduction is unclear. One example, in the intro, " TMAO levels are associated with many human diseases including diverse forms of CVD5-12, obesity13,14, type 2 diabetes15,16, chronic kidney disease (CKD)17,18, neurodegenerative conditions including Parkinson's and Alzheimer's disease19,20, and several cancers21,22" It would be helpful to explain how the previous literature has distinguished that the driver of these phenotypes is TMA/TMAO and not increased choline intake. Basically, for a TMA/O novice reader, a more detailed intro would be helpful.

      We appreciate this insightful comment and have now provided a more expansive historical context for the reader regarding the effects of choline consumption (which impacts many things, including choline, acetylcholine, phosphatidylcholine, TMA, TMAO, etc) versus the primary effects of TMA and TMAO.

      There were also many uses of vague language (regulation/impact/etc). Directionality would be super helpful.

      We thank the reviewer for this recommendation and have improved language as suggested to show directionality of our findings. The terms regulation, impact, shape etc. are used only when we describe multiple variable changing at the same time over the time course of a 24-hour circadian period (some increased and some decreased).

      Reviewer #2 (Recommendations for the authors):

      In the manuscript by Mahen et al., entitled "Gut Microbe-Derived Trimethylamine Shapes Circadian Rhythms Through the Host Receptor TAAR5," the authors investigate the interplay between a host G protein-coupled receptor (TAAR5), the gut microbiota-derived metabolite trimethylamine (TMA), and the host circadian system. Using a combination of genetically engineered mouse and bacterial models, the study demonstrates a link between microbial signaling and circadian regulation, particularly through effects observed in the olfactory system. Overall, this manuscript presents a novel and valuable contribution to our understanding of hostmicrobe interactions and circadian biology. However, several sections would benefit from improved clarity, organization, and mechanistic depth to fully support the authors' conclusions. Below are specific major and minor suggestions intended to enhance the presentation and interpretation of the data.

      Major suggestions:

      (1) Consider adding a schematic/model figure as Panel A early in the manuscript to help readers understand the experimental conditions and major comparisons being made.

      We thank the reviewer for this recommendation and have added a graphical abstract figure to help the reader understand the major comparisons being made. 

      (2) Could the authors present body weight and food intake characteristics in Taar5 KO vs. WT animals?

      We have added body weight data as requested in Figure 1, Figure supplement 1. Although we have not stressed these mice with a high fat diet for these behavioral studies, under chow-fed conditions studied here we did not find any significant differences in body weight. Given no difference in body weight, we did not collect data on food consumption and have mentioned this as a limitation in the discussion.  

      (3) Several figures, especially Figures 3 and 4, and Supplemental Figures, would benefit from more structured organization and expanded legends. Grouping related data into thematic panels (e.g., satiety vs. appetite hormones, behavioral domains) may help improve readability.

      We appreciate the reviewer’s thoughtful comments and agree that reorganization would improve clarity. We have reorganized figures to improve clarity and have expanded the figure legends to provide more detail on experimental methods. 

      (4) Clarify and expand the description of hormonal and cytokine changes. For instance, the phrase "altered rhythmic levels" is vague - do the authors mean dampened, phase-shifted, enhanced, etc., relative to WT controls?

      Given a similar suggestion was made by Reviewer 1, we have provided more precise language focused on directionality and which specific endpoints we are referring to. For anything looking at circadian rhythms, the revised manuscript includes specific indications when we are discussing mesor, amplitude, and acrophase alterations. The terms regulation, impact, shape etc. are used only when we describe multiple complex variables changing at the same time over the time course of a 24-hour circadian period (some increased and some decreased).

      (5) Consider grouping hormones and cytokines functionally (e.g., satiety vs. appetite-stimulating, pro- vs. antiinflammatory) to better interpret how these changes relate to the KO phenotype.

      We thank the reviewer for this recommendation, and have re-organized figure panels to reflect this.

      (6) Please provide a more detailed description of the behavioral results, particularly those in Supplemental Figure 2.

      We have both expanded the methods description in the revised figure legends, but have also added a more detailed description of the behavioral results.

      (7) As with hormonal data, behavioral outcomes would be easier to follow if organized thematically (e.g., locomotor activity, anxiety-like behavior, circadian-related behavior), especially for readers less familiar with behavioral assays.

      We appreciate this reviewer’s comment and agree that we can better group our data to show how each test is associated with the type of behavior it assesses. As a result we have reorganized the behavioral data into broad categories such as olfactory-related, innate, cognitive, depressive/anxiety-like, or social behaviors. We have also new data in each of these behavioral categories to provide a more comprehensive understanding of behavioral alterations seen in Taar5<sup>-/-</sup> mice.

      (8) The following statement needs clarification: "Also, it is important to note that many behavioral phenotypes examined, including tests not shown, were unaltered in Taar5-/- mice (Figures S2G, S2H, and S2I)." Consider rephrasing to explicitly state the intended message: are the authors emphasizing a lack of behavioral phenotype, or highlighting specific unaltered aspects?

      We apologize for this confusing statement, and have changed the verbiage to improve readability. To expand the comprehensive nature of this study, we also now include the tests that were “not shown” in the original submission to provide a more comprehensive understanding of behavioral alterations seen in Taar5<sup>-/-</sup> mice. These new data are included as 6 different figure supplements to main Figure 2.

      (9) The transition from behavior to microbiome data feels abrupt. Can the authors better explain whether the behavioral changes are thought to result from gut microbial function, independent of TMA-Taar5 signaling?

      We apologize for the poor transitions in our writing style. We have spent time to explain the previous findings linking the TMA pathway to circadian reorganization of the gut microbiome (mostly coming from our original paper Schugar R, et al. 2022, eLife) and how this correlates with behavioral phenotypes. Although at this point it is difficult to know whether the microbiome changes are driving behavioral changes, or vice versa it could be central TAAR5 signaling is altering oscillations in gut microbiome, we present our findings here as a framework for follow up studies to more precisely get at these questions. It is important to note that our experiment using defined community gnotobiotic mice with or without the capacity to produce TMA (i.e. CutC-null community) shows that clearly microbial TMA production can impact host circadian rhythms in the olfactory bulb. Additional experiments beyond the scope of this work will be required to test which phenotypes originate from TMA-TAAR5 signaling versus more broad effects of the restructured gut microbiome.

      (10) For Figure 3A, please expand the microbiome results with more granularity:

      (a) Indicate in the Results section whether the sequencing method was 16S amplicon or metagenomic.

      Sequencing was done using 16S rRNA amplicon sequencing using methods published by our group (PMID: 36417437, PMID: 35448550).

      (b) State whether samples were from males, females, or a mix. 

      We have indicated that all mice from Figure 1 were male mice in the revised figure legend.

      (c) Clarify whether beta diversity is based on phylogenetic or non-phylogenetic metrics. Consider using both  types if not already done.

      Beta diversity was analyzed using the Bray-Curtis dissimilarity index as the metric. Details have been included in the methods section.

      (d) Make lines partially transparent in the Beta-diversity plot so that individual points are visible.

      We have now updated the Beta-diversity plot with individual points visualized.

      (e) Clarify what percentage of variation in the Beta-diversity plot is explained by CCA1, and whether this low percentage suggests minimal community-level differences.

      We have updated the Beta-diversity plot to include the R<sup>2</sup> and p-values associated with these data.

      (f) Confirm if the y-axis on the Beta-diversity plot should be labeled CCA2 rather than "CCAA 1".

      We appreciate this comments, given it identified a typographical error in the plot. The revised figure now include the proper label of CCA2 instead of CCAA 1.

      (11) For Figure 3B:

      (a) Provide a description of the taxonomy plot in the results.

      We have added a description of the taxonomy plot in the revised results section.

      (b) Add phylum-level labels and enlarge the legend to improve the readability of genus-level data.

      We agree this is a good suggestion so have enlarged the legend for the genus-level data and have also added phylum-level plots as well in the revised manuscript in Figure 3, figure supplement 1.

      (12) Rhythmicity of the microbiome is central to the manuscript. The current approach of comparing relative abundance at discrete time points is limiting.

      We thank the reviewer for this comment. We agree with this statement that discrete timepoint are not enough to describe circadian rhythmicity. In addition to comparing genotypes at discrete time points, we also used a rigorous cosinor analysis to plot the data over a 24-hour time period, and those differences are shown in the figure itself as well as Table 1. 

      (a) Please describe how rhythmicity was determined, e.g., what data or statistical method supports the statement: "Taar5-/- mice showed loss of the normal rhythmicity for Dubosiella and Odoribacter genera yet gained in amplitude of rhythmicity for Bacteroides genera (Figure 3 and S3)."

      We appreciate this reviewer comment. Rhythmicity was determined using a cosinor analysis by use of an R program. Cosinor analysis is a statistical method used to model and analyze rhythmic patterns in time-series data, typically assuming a sinusoidal (cosine) shape. It estimates key parameters like mesor (mean level), amplitude (height of oscillation), and acrophase (timing of the peak), making it especially useful in fields like chronobiology and circadian rhythm research. We have used this in previous research to describe circadian rhythms. We do plan to improve language considering directionality of these circadian changes. 

      (b) Supplemental Figure S3 needs reorganization to highlight key findings. It's not currently clear how taxa are arranged or what trends are being shown.

      The data in Figure S3 show the entire 24-hour time course of the cecal taxa that were significantly altered for at least one time point between Taar5<sup>+/+</sup> and Taar5<sup>-/-</sup> mice. Given we showed time pointspecific alterations in the Main Figure 3, we thought these more expansive plots would be important to show to depict how the circadian rhythms were altered.

      (c) Supplemental Table 1, which includes 16S features, should be referenced and discussed in the microbiome section.

      We have now referenced and discussed Supplemental Table 1 which includes all cosinor statistics for microbiome and other data presented in circadian time point studies.

      (13) Did the authors quantify the 16S rRNA gene via RT-PCR to determine if this was similar between KO and WT over the 24-hour period?

      We did not quantify 16S rRNA gene via RT-PCR, but do not think adding this will change our overall interpretations.

      (14) Reorganize Figure 4 to align with the order of results discussed-starting with TMA and TMAO, followed by related metabolites like choline, L-carnitine, and gamma-butyrobetaine.

      We thank the reviewer for this comment. We have chosen this organization because it is ordered from substrates (choline, L-carnitine, and betaine) to the microbe-associated products (TMA then TMAO). We will improve the writing associated with this figure to clearly explain this organization.

      (a) Although the changes in the latter metabolites are more modest, they may still have physiological relevance. Could the authors comment on their significance?

      We appreciate this reviewer comment and agree. We have expanded the results and discussion to address this.

      (15) The authors note similarities in circadian gene expression between Taar5 KO mice and Clostridium sporogenes WT vs. ΔcutC mice, but the gene patterns are not consistent.

      (a) Can the authors clarify what conclusions can reasonably be drawn from this comparison?

      We hesitate to make definitive conclusions in the manuscript on why the gene patterns are not consistent, because it would be speculation. However, one major factor likely driving differences is the status of the diversity of the gut microbiome in the different studies. For instance, in the studies using Taar5<sup>+/+</sup> and Taar5<sup>-/-</sup> mice there is a very diverse microbiome in these conventionally housed mice. In contrast, by design the experiment using Clostridium sporogenes WT vs. ΔcutC communities is a reductionist approach that allows us to genetically define TMA production. In these gnotobiotic mice, the simplified community has very limited diversity and this likely alters the host circadian rhythms in gene expression quite dramatically. Although it is impossible to directly compare the results between these experiments given the difference microbiome diversity, there are clearly alterations in host gene expression when we manipulate TMA production (i.e. ΔcutC community) or TMA sensing (i.e. Taar5<sup>-/-</sup>). 

      (16) Were circadian and metabolic genes (e.g., Arntl, Cry1, Per2, Pemt, Pdk4) also analyzed in brown adipose tissue of Taar5 KO mice, and how do these results compare to the Clostridium models?

      We thank the reviewer for this comment. Unfortunately, we did not collect brown adipose tissue in our original Taar5 study. We plan on doing this in future follow up studies studying cold-induced thermogenesis that are beyond the scope of this manuscript. However, we have decided to include data from our two timepoint Taar5 study which looks at ZT2 (9am) and ZT14 (9pm). There are clear differences in circadian genes between these timepoints. 

      (17) To allow a more direct comparison, please ensure the same cytokines (e.g., IL-1β, IL-2, TNF-α, IFN-γ, IL6, IL-33) are reported for both the Taar5 KO and microbial models.

      We thank the reviewer for this comment and now include data from the same cytokines for each study.

      (18) What was the defined microbial community used to colonize germ-free mice with C. sporogenes strains? Did this community exhibit oscillatory behavior?

      To define TMA levels using a genetically-tractable model of a defined microbial community, we leveraged access to the community originally described by our collaborator Dr. Federico Rey (University of Wisconsin – Madison) (PMID: 25784704). We chose this community because it provide some functional metabolic diversity and is well known to allow for sufficient versus deficient TMA production. We are thankful for the reviewer comments about oscillatory behavior of this defined community, and to be responsive have performed sequencing to detect the species over time. These data are now included in the revised manuscript and show that there are clear differences in the oscillatory behavior of the defined community members. These data provide additional support that bacterial TMA production not only alters host circadian rhythms, but also the rhythmic behavior of gut bacteria themselves which has never been described before.

      (19) Can the authors explain the rationale for measuring additional metabolites such as tryptophan, indole acetic acid, phenylacetic acid, and phenylacetylglycine? How are these linked to CutC gene function or Taar5 signaling?

      We appreciate that this could be confusing, but have included other gut microbial metabolites to be as comprehensive as possible. This is important to include because we have found in other gnotobiotic studies where we have genetically altered metabolite production, if we alter one gut microbe-derived metabolite there can be unexpected alterations in other distinct classes of microbe-derived metabolites (PMID: 37352836). This is likely due to the fact that complex microbe-microbe and microbehost interactions work together to define systemic levels of circulating metabolites, influencing both the production and turnover of distinct and unrelated metabolites.

      (20) The authors make several strong claims suggesting that loss of Taar5 or disruption of its ligand directly alters the circadian gene network. However, the current data are correlative. The authors should clarify that these findings demonstrate associations rather than direct causal effects, unless additional mechanistic evidence is provided. Approaches such as studies conducted in constant darkness, measurements of wheelrunning behavior, or analyses that control for potential confounding factors, e.g., inflammation or metabolic disruption, would help establish whether the observed changes in clock gene expression are primary or secondary effects. The authors are encouraged to either soften these causal claims or acknowledge this limitation explicitly in the discussion.

      We thank the reviewer for this comment. We agree and have softened our language about direct effects of TMA via TAAR5 because we agree the data presented here are correlative only. 

      Minor suggestions:

      (1) Avoid repetitive phrases such as "it is important to note..." for improved flow. Rephrasing these instances will enhance readability.

      We thank the reviewer for this suggestion and have deleted such repetitive phrases.  

      (2) For Figure 2, remove interpretations above he graphs and use simple, descriptive panel labels, similar to those in Supplemental Figure 2.

      We have removed these interpretations as suggested, but have retained descriptive panel labels to help the reader understand what type of data are being presented.

      Reviewer #3 (Recommendations for the authors):

      Minor:

      In Figure 1D, UCP1 does not appear to be significantly changed.

      We thank the reviewer for this comment and agree that UCP1 gene expression is not significantly altered . However, given the key role that UCP1 plays in white adipose tissue beiging, which is suppressed by the TMAO pathway, we think it is critical to show that this effect appears unaffected by perturbed TMA-TAAR5 signaling.

      It would be helpful, in the discussion, to summarize any consistent changes across Taar5 KO, CutC deletion, and FMO3 deletion.

      We have added this to the discussion, but as discussed above we hesitate to make strong interpretations about consistency between the models because the microbiome diversity is so different between the studies, and we did not measure all endpoints in both models.

      For the Cosinor analysis, it may be helpful to remove the p-values that are >0.05 from the figures.

      We have now removed any non-significant p-values that are associated with our figures. 

      For Figure 2, Supplement 1E, what are the two bars for each genotype?

      We appreciate the reviewer pointing this out and will further explain this test in the figure with labels and in the legend.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Editors comments:

      I would encourage you to submit a revised version that addresses the following two points:

      [a] The point from Reviewer #1 about a possible major confounding factor. The following article might be germane here: Baas and Fennell, 2019: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3339568

      I don’t believe that the point raised by reviewer 1 is a confounder, see my response below.

      This article highlighted was in my reading list, but I did not cite it because I was confused by its methods.

      The point from Reviewer #4 about the abstract. It is important that the abstract says something about how reviewers reacted to the original versions of articles in which they were cited (ie, the odds ratio = 0.84, etc result), before going on to discuss how they reacted to revised articles (ie, the odds ratio = 1.61, etc result). I would suggest doing this along the following lines - but please feel free to reword the passage "but this effect was not strong/conclusive":

      When reviewers were cited in the original version of the article under review, they were less likely to approve the article compared with reviewers who were not cited, but this effect was not strong/conclusive (odds ratio = 0.84; adjusted 99.4% CI: 0.69-1.03). However, when reviewers were cited in the revised version of the article, they were more likely to approve compared with reviewers who were not cited (odds ratio = 1.61; adjusted 99.4% CI: 1.16-2.23).

      I have changed the abstract to include the odds ratios for version 1 and have used the same wording as from the main text.

      Reviewer #1 (Public review):

      Summary:

      The work used open peer reviews and followed them through a succession of reviews and author revisions. It assessed whether a reviewer had requested the author include additional citations and references to the reviewers' work. It then assessed whether the author had followed these suggestions and what the probability of acceptance was based on the authors decision. Reviewers who were cited were more likely to recommend the article for publication when compared with reviewers that were not cited. Reviewers who requested and received a citation were much likely to accept than reviewers that requested and did not receive a citation.

      Strengths and weaknesses:

      The work's strengths are the in-depth and thorough statistical analysis it contains and the very large dataset it uses. The methods are robust and reported in detail.

      I am still concerned that there is a major confounding factor: if you ignore the reviewers requests for citations are you more likely to have ignored all their other suggestions too? This has now been mentioned briefly and slightly circuitously in the limitations section. I would still like this (I think) major limitation to be given more consideration and discussion, although I am happy that it cannot be addressed directly in the analysis.

      This is likely to happen, but I do not think it’s a confounder. A confounder needs to be associated with both the outcome and the exposure of interest. If we consider forthright authors who are more likely to rebuff all suggestions, then they would receive just as many citation and self-citation requests as authors who were more compliant. The behaviour of forthright authors would likely only reduce the association seen in most authors which would be reflected in the odds ratios.

      Reviewer #2 (Public review):

      Summary:

      This article examines reviewer coercion in the form of requesting citations to the reviewer's own work as a possible trade for acceptance and shows that, under certain conditions, this happens.

      Strengths:

      The methods are well done and the results support the conclusions that some reviewers "request" self-citations and may be making acceptance decisions based on whether an author fulfills that request.

      Weakness:

      I thank the author for addressing my comments about the original version.

      Reviewer #3 (Public review):

      Summary:

      In this article, Barnett examines a pressing question regarding citing behavior of authors during the peer review process. In particular, the author studies the interaction between reviewers and authors, focusing on the odds of acceptance, and how this may be affected by whether or not the authors cited the reviewers' prior work, whether the reviewer requested such citations be added, and whether the authors complied/how that affected the reviewer decision-making.

      Strengths:

      The author uses a clever analytical design, examining four journals that use the same open peer review system, in which the identities of the authors and reviewers are both available and linkable to structured data. Categorical information about the approval is also available as structured data. This design allows a large scale investigation of this question.

      Weaknesses:

      My original concerns have been largely addressed. Much more detail is provided about the number of documents under consideration for each analysis, which clarifies a great deal.

      Much of the observed reviewer behavior disappears or has much lower effect sizes depending on whether "Accept with Reservations" is considered an Accept or a Reject. This is acknowledged in the results text. Language has been toned down in the revised version.

      The conditional analysis on the 441 reviews (lines 224-228) does support the revised interpretation as presented.

      No additional concerns are noted.

      Reviewer #4 (Public review):

      Summary:

      This work investigates whether a citation to a referee made by a paper is associated with a more positive evaluation by that referee for that paper. It provides evidence supporting this hypothesis. The work also investigates the role of self-citations by referees where the referee would ask authors to cite the referee's paper.

      Strengths:

      This is an important problem: referees for scientific papers must provide their impartial opinions rooted in core scientific principles. Any undue influence due to the role of citations breaks this requirement. This work studies the possible presence and extent of this.

      The methods are solid and well done. The work uses a matched pair design which controls for article-level confounding and further investigates robustness to other potential confounds.

      Weaknesses:

      The authors have addressed most concerns in the initial review. The only remaining concern is the asymmetric reporting and highlighting of version 1 (null result) versus version 2 (rejecting null). For example the abstract says "We find that reviewers who were cited in the article under review were more likely to recommend approval, but only after the first version (odds ratio = 1.61; adjusted 99.4% CI: 1.16 to 2.23)" instead of a symmetric sentence "We find ... in version 1 and ... in version 2".

      The latest version now includes the results for both versions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #2 (Public review):

      (1) Why would BPS not reduce RLS in WT cells? The authors could test whether OE of FIT2 reduces RLS in WT cells.  

      Our data indicate that the iron regulon gets turned on naturally in old cells, presumably due to reduced iron sensing, limiting their lifespan. Although we haven’t tested it experimentally, BPS would also turn on the iron regulon presumably in wild type cells and therefore would have a redundant effect with the activation of the iron regulon that occurs naturally during normal aging. It may be interesting in the future to see if higher levels of BPS can shorten the lifespan of wildtype cells. Similarly, we would predict that overexpression of FIT2 may reduce the lifespan, as its deletion has been shown to extend RLS.  

      (2) The authors should add a brief explanation for why the GDP1 promoter was chosen for Ssd1 OE.

      We used the same promoter that was used to overexpress Ssd1 in all previous studies. This is now stated in the text along with the relevant citations. 

      (3) On page 12, growth to saturation was described as glucose starvation. This is more accurately described as nutrient deprivation. Referring to it as glucose starvation is akin to CR, which growing to saturation is not. Ssd1 OE formed condensates upon saturation but not in CR. Why do the authors think Ssd1 OE did not form condensates upon CR?

      Too mild a stress?

      This is a fair comment, and we have now changed glucose starvation to nutrient deprivation, as it is more accurate. The effects of nutrient starvation are profound: the cell cycle stops, autophagy is induced, cells undergo the diauxic shift, metabolism changes. None of these changes occur during calorie restriction (0.05% glucose) such that it is not too surprising that Ssd1 does not form condensates during CR. We speculate that the stress is just too mild.   

      (4) The authors conclude that the main mechanism for RLS extension in CR and Ssd1 OE is the inhibition of the iron regulon in aging cells. The data certainly supports this. However, this may be an overstatement as other mutations block CR, such as mutations that impair respiration. The authors do note that induction of the iron regulon in aging cells could be a response to impaired mitochondrial function. Thus, it seems that the main goal of CR and Ssd1 OE may be to restore mitochondrial function in aging cells, one way being inactivation of the iron regulon. A discussion of how other mutations impact CR would be of benefit.

      While some labs have shown that respiration impacts CR, this is not the case in other studies. For example, an impactful paper by Kaeberlein et al., PLOS Genetics 2005 showed that CR does extend lifespan in respiratory deficient strains using many different strain backgrounds.

      (5) The cell cycle regulation of Ssd1 OE condensates is very interesting. There does not appear to be literature linking Ssd1 with proteasome-dependent protein turnover. Many proteins involved in cell cycle regulation and genome stability are regulated through ubiquitination. It is not necessary to do anything here about it, but it would be interesting to address how Ssd1 condensates may be regulated with such precision.

      we see no evidence of changes in Ssd1 protein intensity during the cell cycle. The difference therefore we speculate is at the post translational level rather than Ssd1 degradation and there are known cell cycle regulated phosphatase and kinase that regulates Ssd1 phosphorylation and condensation state whose timing of function match when the Ssd1 condensates appear and dissolve in the cell cycle. We have now discussed this and elude to it in the model. 

      (6) While reading the draft, I kept asking myself what the relevance to human biology was. I was very impressed with the extensive literature review at the end of the discussion, going over how well conserved this strategy is in yeast with humans. I suggest referring to this earlier, perhaps even in the abstract. This would nail down how relevant this model is for understanding human longevity regulation.

      Thank you, we have now mentioned in the abstract the relevance to human work. 

      In conclusion, I enjoyed reading this manuscript, describing how Ssd1 OE and CR lead to RLS increases, using different mechanisms. However, since the 2 strategies appear to be using redundant mechanisms, I was surprised that synergism was not observed.

      We thank the reviewer for their kind comment. We propose that Ssd1 overexpression impacts the levels of the iron regulon transcripts, which would be downstream of the point in the pathway that is affected by CR, i.e., nuclear localization of Aft1. The lack of synergy fits with this model, as Ssd1 overexpression cannot impact the iron regulon transcripts if they are not induced due to CR. We have now improved the model to make the impact of these different anti-aging interventions on activation of the iron regulon more clear.

      Reviewer #3 (Public review):

      My main concern is that the central reasoning of the paper-that Ssd1 overexpression and CR prevent the activation of the iron regulon-appears to be contradicted by previous findings, and the authors may actually be misrepresenting these studies, unless I am mistaken. In the manuscript, the authors state on two occasions:

      "Intriguingly, transcripts that had altered abundance in CR vs control media and in SSD1 vs ssd1∆ yeast included the FIT1, FIT2, FIT3, and ARN1 genes of the iron regulon (8)"

      "Ssd1 and CR both reduce the levels of mRNAs of genes within the iron regulon: FIT1, FIT2, FIT3 and ARN1 (8)"

      However, reference (8) by Kaeberlein et al. actually says the opposite:

      "Using RNA derived from three independent experiments, a total of 97 genes were observed to undergo a change in expression >1.5-fold in SSD1-V cells relative to ssd1d cells (supplemental Table 1 at http://www.genetics.org/supplemental/). Of these 97 genes, only 6 underwent similar transcriptional changes in calorically restricted cells (Table 2). This is only slightly greater than the number of genes expected to overlap between the SSD1-V and CR datasets by chance and is in contrast to the highly significant overlap in transcriptional changes observed between CR and HAP4 overexpression (Lin et al. 2002) or between CR and high external osmolarity (Kaeberlein et al. 2002). Intriguingly, of the 6 genes that show similar transcriptional changes in calorically restricted cells and SSD1-V cells, 4 are involved in ironsiderochrome transport: FIT1, FIT2, FIT3, and ARN1 (supplemental Table 1 at http://www.genetics.org/supplemental/)."

      Although the phrasing might be ambiguous at first reading, this interpretation is confirmed upon reviewing Matt Kaeberlein's PhD thesis: https://dspace.mit.edu/handle/1721.1/8318 (page 264 and so on).

      Moreover, consistent with this, activation of the iron regulon during calorie restriction (or the diauxic shift) has also been observed in two other articles:

      https://doi.org/10.1016/S1016-8478(23)13999-9

      https://doi.org/10.1074/jbc.M307447200

      Taken together, these contradictory data might blur the proposed model and make it unclear how to reconcile the results.

      We thank the reviewer for pointing this out. Upon further consideration, we have now removed all mention of this paper from our manuscript as it is irrelevant to our situation, because the mRNA abundance studies during CR or with and without Ssd1 were not performed in situations in which the iron regulon is even activated such as aging, so there would not be any opportunity to detect reduced transcript levels due to CR or Ssd1 presence. Also, none of these studies were performed with Ssd1 overexpression which is the situation we are examining.  Our data clearly show that Ssd1 overexpression and CR reduced / prevented, respectively, production of proteins from the iron regulon during aging.

      We do not feel that the iron regulon being activated by nutrient depletion at the diauxic shift is a fair comparison to the situation in cells happily dividing during CR. The levels of nutrient deprivation used in those studies have profound effects including arresting cell growth, activating autophagy, altering metabolism. The levels of CR that we use (0.05% glucose) does not activate any of these changes nor the iron regulon in young cells or old cells (Fig. 4).  

      Reviewer #1 (Recommendations for the authors):

      (1) The role of Ssd1 condensate formation in mRNA sequestration and lifespan expansion remains unclear. Thus, the study involves two parts (Ssd1 condensate formation and lifespan expansion via limiting Fe2+ accumulation), which are poorly linked. The study will therefore benefit from further data linking the two aspects.

      Future experiments are planned to determine what mRNAs reside in the age-induced Ssd1 overexpression condensates, to determine if they include the iron regulon transcripts. This will require us to optimize isolation of old cells and isolation of the Ssd1 condensates from them, and is beyond the scope of the present study.

      (2) The beneficial effects of Ssd1 overexpression and calorie restriction (CR) on lifespan are epistatic, yet the claim that both experimental conditions act via the same pathway should be further documented. It is recommended to combine Ssd1 overexpression with a well-defined condition that expands lifespan through a mechanism not involving changes in Fe2+ levels. A further increase in lifespan upon combining such conditions would at least indirectly support the authors' claim.

      We have more than epistatic evidence to indicate that Ssd1 overexpression and CR are in the same pathway. Ssd1 overexpression and CR result in failure to properly induce the iron regulon during aging and subsequent reduced levels of iron, resulting in lifespan extension, supporting that they act via the same pathway. We do appreciate the point though and epistasis analyses are on our list for future studies.

      (3) It is highly recommended to analyze ssd1 knockout cells: Is the shortened lifespan caused by intracellular Fe2+ accumulation, as predicted by the model? Does the knockout lead to an overactivation of the iron regulon? Such analysis will also document the physiological relevance of authentic Ssd1 levels in controlling yeast lifespan. The authors could test this possibility by determining intracellular Fe2+ levels (as done in Figure 5) and testing whether the mutant cells are partially rescued by the presence of an iron chelator (as done in Figure 5C).

      We don’t think the normal role of Ssd1 is to sequester the iron regulon mRNAs to prevent its activation, given that wild type yeast with endogenous Ssd1 activates the iron regulon during aging. Rather, the failure to activate the iron regulon during aging is unique to when Ssd1 is overexpressed not at endogenous Ssd1 levels. As such, it may not be the case that the short lifespan of ssd1 yeast is due to iron accumulation (if that happens); yeast lacking SSD1 also have cell wall biogenesis problems and the defects in cell wall biogenesis shorten the replicative lifespan (Molon et al., Biogerentology 2018  PMID 29189912). 

      (4) Figure 4: The authors could not analyze the impact of Ssd1 overexpression on the localization of GFP-Aft1 due to synthetic sickness. This was not observed under calorie restriction (CR) conditions and is therefore unexpected. Why should Ssd1 overexpression and CR have such diverse impacts on cellular physiology when combined with GFP-Aft1? Isn`t that observation arguing against CR and increased Ssd1 levels acting through the same pathway? A further clarification of this point is necessary.

      Without further experimentation, we can only speculate that cellular changes that are unique to overexpression of Ssd1 and not shared with CR cause a negative interaction with GFP-Aft1. Of note, Aft1 has functions in addition to its role in activating the iron regulon (aft1∆ strains have a growth defect independent from its role in iron regulon activation [27]) and we have shown previously that overexpressed Ssd1 has a reduction in global protein translation. Future experiments would be necessary to delineate the basis for this synthetic sickness.

      (5) Lowering Fe2+ levels upon Ssd1 overexpression is predicted to reduce oxidative stress. It is suggested to determine ROS levels upon Ssd1 overexpression to bolster that point.

      This is a great suggestion. The lowering of Fe2+ in the Ssd1 mutants is something that happens at the end of the lifespan and therefore we would need to do experiments to detect reduced ROS using a live dye on our microfluidics platform. We are not aware of any live fluorescent reporters of ROS.  

      Reviewer #2 (Recommendations for the authors):

      (1) Page 6, 7th line of Replicative lifespan analyses, there is a double bracket.

      This has been corrected. Thank you

      (2) Page 18, line 6 of "failure to activate..." section, "revered" should be replaced with "reversed".

      This has been corrected. Thank you

      (3) Page 23, fix writing on line 2 of "Effects of CR..." section.

      This has been corrected. Thank you

      (4) Page 24, Author contributions section, replace "performed devised" with "designed".

      This has been corrected. Thank you

      Reviewer #3 (Recommendations for the authors):

      (1) Figure 3C: The panel legend is somewhat confusing due to the color scheme and the scattering of labels across panels. A more consistent labeling strategy would help readability.

      We agree, and the labelling has now been improved. Thank you. 

      (2) Figure 3D vs Figure 3B: it appears that Fit2 activation occurs substantially earlier than Aft1 translocation, which reduces the predictive value of Fit2 compared to Aft1. This is puzzling given that Fit2 is expected to be a direct target of Aft1. Could this discrepancy be related to the thresholding used for Fit2-mCherry display? The color scale in Figure 3D is also somewhat misleading, as most of the segments appear greenish. A continuous color gradient, perhaps restricted to the [10-120] interval, might give a clearer picture of iron regulon activation.

      For the Aft1-mcherry experiment, we are only able to accurately annotate nuclear localization when Aft1 has been fully (or mostly) translocated into the nucleus from the cytoplasm such that this data is likely to be on the conservative side. However, activation of the iron regulon likely occurs as Aft1 is translocated into the nucleolus, so a minimal initial amount of Aft1 (for which we don’t have enough resolution in this system to detect) could be enough for FIT2 and ARN1 induction.  By contrast, the Fit2 and Arn1 signal is measuring increase over a background of nothing, so is very easy to detect even at low level induction. To allow the readers to see all our data without over thresholding, we prefer to present the induction of Fit2 and Arn1 at all intensity levels even the very low level induction (green).

      (3) "In control strains, expression of Fit2 and Arn1 varied across the population, but generally increased with age": for the right panel, normalization might be more appropriate. What is the fold change in fluorescence during lifespan? Reporting ΔmCherry intensity alone does not provide a quantitative measure of induction.

      We have changed the figure to show quantitation as fold change, as suggested.

      (4) Figure 6 (model): The model figure is conceptually useful but not easy to follow in its current form; a revised schematic with a clearer depiction of the pathway activations at different replicative ages would be helpful.

      We have changed the figure to make the model more clear, as suggested.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Ravichandran et al investigate the regulatory panels that determine the polarization state of macrophages. They identify regulatory factors involved in M1 and M2 polarization states by using their network analysis pipeline. They demonstrate that a set of three regulatory factors (RFs) i.e., CEBPB, NFE2L2, and BCL3 can change macrophage polarization from the M1 state to the M2 state. They also show that siRNA-mediated knockdown of those 3-RF in THP1-derived M0 cells, in the presence of M1 stimulant increases the expression of M2 markers and showed decreased bactericidal effect. This study provides an elegant computational framework to explore the macrophage heterogeneity upon different external stimuli and adds an interesting approach to understanding the dynamics of macrophage phenotypes after pathogen challenge.

      Strengths:

      This study identified new regulatory factors involved in M1 to M2 macrophage polarization. The authors used their own network analysis pipeline to analyze the available datasets. The authors showed 13 different clusters of macrophages that encounter different external stimuli, which is interesting and could be translationally relevant as in physiological conditions after pathogen challenge, the body shows dynamic changes in different cytokines/chemokines that could lead to different polarization states of macrophages. The authors validated their primary computational findings with in vitro assays by knocking down the three regulatory factors-NCB.

      We thank the reviewer for reading our manuscript and for the encouraging comments.

      Weaknesses:

      One weakness of the paper is the insufficient analysis performed on all the clusters. They used macrophages treated with 28 distinct stimuli, which included a very interesting combination of pro- and anti-inflammatory cytokines/factors that can be very important in the context of in vivo pathogen challenge, but they did not characterize the full spectrum of clusters. 

      We have performed a functional enrichment analysis of all the clusters and added a section describing the results (Fig 1B). We believe this work will provide a basis for future experiments to characterize other clusters.

      We have also performed a Principal Component Analysis (PCA) using hall mark genes of inflammation and the NCB panel alone to show the relative position of all clusters with respect to each other

      Although they mentioned that their identified regulatory panels could determine the precise polarization state, they restricted their analysis to only the two well-established macrophage polarization states, M1 and M2. Analyzing the other states beyond M1 and M2 could substantially advance the field. They mentioned the regulatory factors involved in individual clusters but did not study the potential pathway involving the target genes of these regulatory factors, which can show the importance of different macrophage polarization states. Importantly, these findings were not validated in primary cells or using in vivo models.

      We agree it would be useful to demonstrate the polarization switch in other systems as well. However, it is currently infeasible for us to perform these experiments. 

      Reviewer #2 (Public Review):

      Summary:

      The authors of this manuscript address an important question regarding how macrophages respond to external stimuli to create different functional phenotypes, also known as macrophage polarization. Although this has been studied extensively, the authors argue that the transcription factors that mediate the change in state in response to a specific trigger remain unknown. They create a "master" human gene regulatory network and then analyze existing gene expression data consisting of PBMC-derived macrophage response to 28 stimuli, which they sort into thirteen different states defined by perturbed gene expression networks. They then identify the top transcription factors involved in each response that have the strongest predicted association with the perturbation patterns they identify. Finally, using S. aureus infection as one example of a stimulus that macrophages respond to, they infect THP-1 cells while perturbing regulatory factors that they have identified and show that these factors have a functional effect on the macrophage response.

      Strengths:

      The computational work done to create a "master" hGRN, response networks for each of the 28 stimuli studied, and the clustering of stimuli into 13 macrophage states is useful. The data generated will be a helpful resource for researchers who want to determine the regulatory factors involved in response to a particular stimulus and could serve as a hypothesis generator for future studies.

      The streamlined system used here - macrophages in culture responding to a single stimulus - is useful for removing confounding factors and studying the elements involved in response to each stimulus.

      The use of a functional study with S. aureus infection is helpful to provide proof of principle that the authors' computational analysis generates data that is testable and valid for in vitro analysis.

      We thank the reviewer for reading our manuscript and for the encouraging comments

      Weaknesses:

      Although a streamlined system is helpful for interrogating responses to a stimulus without the confounding effects of other factors, the reality is that macrophages respond to these stimuli within a niche and while interacting with other cell types. The functional analysis shown is just the first step in testing a hypothesis generated from this data and should be followed with analysis in primary human cells or in an in vivo model system if possible.

      It would be helpful for the authors to determine whether the effects they see in the THP-1 immortalized cell line are reproduced in another macrophage cell line, or ideally in PBMC-derived macrophages.

      We agree; It would be useful in the future to demonstrate the polarization switch in other systems as well. We believe the results we provide here will inform future studies on other systems. 

      The paper would benefit from an expanded explanation of the network mining approach used, as well as the cluster stability analysis and the Epitracer analysis. Although these approaches may be published elsewhere, readers with a non-computational background would benefit from additional descriptions.

      We have elaborated on the network mining approach and added a schematic diagram (Fig S13) to describe the EpiTracer algorithm.

      Although the authors identify 13 different polarization states, they return to the iM0/M1/M2 paradigm for their validation and functional assays. It would be useful to comment on the broader applications of a 13-state model.

      We have included a new figure panel describing the functional enrichment analysis of all the clusters (Fig 1B) and added a section describing the results. We have also performed a Principal Component Analysis (PCA) using hallmark gene of inflammation and the NCB panel alone to show the relative position of all clusters with respect to each other. The PCA plot shows that C11(M1) and C3(M2) are roughly at two extreme ends, with other clusters between them, forming something resembling a punctuated continuum of states.

      The relative contributions of each "switching factor" to the phenotype remain unclear, especially as knocking out each individual factor changes different aspects of the model (Fig. S5).

      Fig S5 shows the effect on phenotype upon individual knockdown of the switching factors, from which we deduce that CEBPB has the largest contribution in determining the phenotype. However, we maintain that all three genes are necessary as a panel for M1/M2 switching. 

      Reviewer #1 (Recommendations For The Authors):

      The manuscript by Ravichandran et al describes the networks of genes that they named j"RF" associated with M1 to M2 polarization of macrophages by using their computational pipelines. They have shown 13 clusters of human macrophage polarization state by using an available database of different combinatorial treatments with cytokines, endotoxin, or growth factors, which is interesting and could be useful in the research field. However, there are a few comments which will help to understand the subject more precisely.

      (1,2) The authors claimed to identify key regulatory factors involved in the human macrophage polarization from M1 to M2. However, recent advances suggest that macrophage polarization cannot be restricted to M1 and M2 only, which is also supported by the authors' data that shows 13 clusters of macrophages. However, they only focused on the difference between clusters 11 and 3 considering conventional M1 and M2. It will be more interesting to analyze the other clusters and how they relate to the established and simplistic M1 and M2 paradigms.

      It will be interesting to know if they found any difference in the enriched pathways among these different clusters considering the exclusive regulatory factors and their targets.

      We appreciate the point and have addressed it as follows. In the revised manuscript, we have discussed the clusters in detail and have provided the key regulatory factors (RF) combinations and target genes that define distinct macrophage population states (Please refer: Data file S2, S3). We have also discussed the associated immunological processes with each cluster, particularly in relation to the C11 and C3 clusters. We have added a new panel in Fig 1 to illustrate a heatmap indicating the enrichment of pathways relevant to inflammation in each of the clusters (Fig 1B).   Indeed, there is a substantial difference in the enrichment terms between the extreme ends (M1, M2) and significant differences in some of the pathways between clusters.   

      (3) The authors have shown the involvement of NCB at 72h post LPS treatment. Are these RF involved in late response genes or act at the earlier time point of LPS treatment? Understanding the RF involvement in the dynamic response of macrophages to any stimulant will be important.

      Using the data available for different time points (30 mins to 72 hours), we plotted the fold change (with respect to unstimulated cells) in M1 and M2 clusters for each of the NCB genes and observe clear divergence in the trend at 24 hours and have provided them as newly added (Supplementary Figure 9  A, B, C).

      (4) The authors showed that the knockdown of RF- NCB can switch the M1 to M2. However, they showed a few conventional markers known to be M2 markers. What happens if NCB is overexpressed or knocked down in other treatment conditions/other clusters? Is the RF-NCB only involved in these two specific stimulations or their overexpression can promote M2 polarization in any given stimuli?

      It is an interesting question but for practical reasons, experimental work was limited to M1 and M2 clusters as the aim was to establish proof of concept and could not be scaled up for all clusters, which would require a large amount of work and possibly a separate study.  We believe the description of the clusters that we have provided will enable the design of future experiments that will throw light on the significance of the intermediate clusters.  

      (5) The authors have shown that knockdown of RF- NCB decreases pathogen clearance, but what are their altered functions? Are they more efficient in cellular debris clearance or resolution of inflammation? The authors can check the mRNA expression of markers/cytokines involved in those processes, in the NCB knockdown condition.

      Indeed. Expression levels were measured for the following genes: CXCL2, IL1B, iNOS, SOCS3 (which are pro-inflammatory markers), as well as MRC1, ARG1, TGFB, IL10 (anti-inflammatory markers), as shown in Fig 4B.  

      Minor comments:

      (1, 2). How the authors evaluate the performance of their knowledge-based gene network. The authors should write the methods in detail, how they generated the simulated network, and evaluated the simulated dataset.

      Gene network construction and module detection have many tools available. The authors need to mention which one they used. The authors should show whether their findings are consistent with at least another two module-detection methods (eg; "RedeR") to strengthen their claim.

      We have added a schematic figure (Supplementary Fig S11) and detailed description of network construction and mining in the Methods section, as follows: We have reconstructed a comprehensive knowledge-based human Gene Regulatory Network (hGRN), which consists of Regulatory Factors (RF) to Target Gene (TG) and RF to RF interactions. To achieve this, we curated experimentally determined regulatory interactions (RF-TG, RF-RF) associated with human regulatory factors (Wingender et al., 2013). These interactions were sourced from several resources, including: (a) literature-curated resources like the Human Transcriptional Regulation Interactions database (HTRIdb) (Bovolenta et al., 2012), Regulatory Network Repository (RegNetwork) (Liu et al., 2015), Transcriptional Regulatory Relationships Unraveled by Sentence-based Text-mining (TRRUST) (Han et al., 2015), and the TRANSFAC resource from Harmonizome (Rouillard et al., 2016);  (b) ChEA3, which contains ChIP-seq determined interactions (Keenan et al., 2019); and (c) high-confidence protein-protein binding interactions (RF-RF) from the human protein-protein interaction network-2 (hPPiN2) (Ravichandran et al., 2021). As a result, our hGRN comprises 27,702 nodes and 890,991 interactions.  It is important to note that none of the edges/interactions in the hGRN are data-driven. We utilized this extensive hGRN, which encompasses the experimentally determined interactions/edges, to infer stimulant-specific hGRNs and top paths using our in-house network mining algorithm, ResponseNet. We have previously demonstrated that ResponseNet, which utilizes a knowledge-based network and a sensitive interrogation algorithm, outperformed data-driven network inference methods in capturing biologically relevant processes and genes, whose validation is reported earlier (Ravichandran and Chandra, 2019; Sambaturu et al., 2021).

      We utilized our in-house response network approach to identify the stimulant-specific top active and repressed perturbations (Ravichandran and Chandra, 2019; Sambaturu et al., 2021). This is clearly described in the revised manuscript. To summarize, we generated stimulant-specific Gene Regulatory Networks (GRNs) by applying weights to the master human Gene Regulatory Network (hGRN) based on differential transcriptomic responses to stimulants (i.e., comparing stimulant-treated conditions to baseline). We then produced individually weighted networks for each stimulant and implemented a refined network mining technique to extract the most significant pathways. Furthermore, we have previously conducted a systematic comparison of our network mining strategy with other data-driven module detection methods, including jActiveModules (Ideker et al, 2002), WGCNA (Langfelder et al, 2008), and ARACNE (Margolin et al, 2006). Our findings demonstrated that our approach outperformed conventional data-driven network inference methods in capturing the biologically pertinent processes and genes (Ravichandran and Chandra, 2019). Since we have experimentally validated what we predicted from the network analysis, we do not see a need for performing the computational analysis with another algorithm. Moreover, different network analyses are based on different aspects of identifying functionally relevant genes or subnetworks. While each of them output useful information, given the scale of the network and the number of different biologically significant subnetworks and genes that could be present in an unbiased network such as what we have used, the output from different methods need not agree with each other as they may capture different aspects all together and hence is not guaranteed to be informative.  

      (3) Representation of Fig 2B is difficult to understand the authors' interpretation of 'the 3-RF combination has 1293 targets, 359 covering about 53% of the top-perturbed network' for general readers. If the authors can simplify the interpretation will be helpful for the readers.

      This is replaced with clearer figures in the revised manuscript (Figure 2A, 2B), and the associated text is also rephrased for clarity.

      Reviewer #2 (Recommendations For The Authors):

      Major comments:

      (1) It would be helpful for the authors to determine whether the effects they see in the THP-1 immortalized cell line are reproduced in another macrophage cell line, or ideally in PBMC-derived macrophages if this is feasible. If using PBMC- or bone marrow-derived macrophages is beyond the scope of what the authors can reasonably perform, they could consider using another macrophage cell line such as RAW 264.7 cells, which would also provide orthogonal validation from a mouse model.

      At this point of time, it is unfortunately infeasible for us to perform these experiments, due to resource limitation.  Moreover, it would require a lot of time. We hope that our work provides pointers for anyone working on mouse models or other model systems to design their studies on regulatory controls and the aspect of generalizability of our findings in Thp-1 cell lines to other systems will eventually emerge.

      (2) It would be helpful for the authors to provide an expanded explanation of the network mining approach used, as well as the cluster stability analysis and the Epitracer analysis. Although these approaches may be published elsewhere, readers with a non-computational background would benefit from additional descriptions. A schematic figure would also be helpful to clarify their approach.

      We have added a new schematic diagram in Supplementary figures (S13) and a detailed text in the Methods section describing the network mining analysis and epitracer identification in the revised manuscript. 

      (3) It would be helpful for the authors to comment on whether the thirteen polarization states that they identify align with other analyses that have been performed using data collected from stimulated macrophages, or whether this is a novel finding, especially as the original paper from which the primary data are derived identified 9 clusters. More broadly, since the authors eventually return to the M1-M2 paradigm, it is unclear whether there is any functional support for a 13-state model - it is also possible that macrophages exist along a continuum of stimulation states rather than in discrete clusters. This at least merits further discussion, which could focus on different axes of polarization as discussed and shown in the original paper.

      As described in the manuscript, Clustering based on the differential transcriptome profile of RF-set1, which contains 265 transcription factors (TFs), in response to 28 stimulants, resulted in 13 distinct clusters. The cluster member associations inferred from RF-set1 were similar in number and pattern to those inferred from the entire differential transcriptome (n=12,164; Fig. S2, cophenetic coefficient = 0.68; p-value = 1.25e−51). Furthermore, the inferred cluster pattern largely matched the clustering pattern previously described for the same dataset  (Xue et al., 2014).  Our contribution: The pattern we observed from the top-ranked epicenters in each cluster suggests that a subset of differentially expressed genes (DEGs) present in our top networks is sufficient for achieving differentiation. Our gene-regulatory models suggest that saturated (SA and PA) and unsaturated (LA, LiA, and OA) fatty acids, which were previously grouped together, mediate distinct modes of resolution and are now separated into two sub-branches. Similarly, the effects of IFNγ and sLPS, previously combined, are now distinctly resolved, aligning with known regulatory differences (Hoeksema et al., 2015; Kang et al., 2019). 

      The principal takeaway from this analysis is not the exact number of clusters but rather the molecular basis it provides for the differentiation of functional states, with M1 and M2 representing two ends of the spectrum. Several other states are dispersed within the polarization spectrum, which we describe as a punctuated continuum. For our switching studies, we focused on clusters C11 (M1-like) and C2 (M2-like) due to their established functional relevance. However, future studies are required to explore the functional relevance of other clusters. We have added a discussion on this aspect as suggested.

      (4) It would be helpful to define the contribution of each component of the NCB group to M1 polarization.

      We assessed the impact of CEBPB, NFE2L2, and BCL3 on C2 (M1-like) polarization states by quantifying the expression levels of M1 and M2 markers. Our findings indicate that knocking down CEBPB led to a significant downregulation in the expression of M1 markers and an increase in M2 marker expression. In contrast, NFE2L2 and BCL3 knockdown resulted in decreased expression of M1 markers without a corresponding significant increase in M2 markers. These results suggest that CEBPB is crucial for M1 to the M2 transition. We have added a note on pg 22 to emphasize this better.

      (5) NRF2, CEBPb, and BCL3 all have well-described roles in macrophage polarization. To add clarity to their discussion, the authors should cite relevant literature (eg PMIDs 15465827, 27211851, and others) and discuss how their findings extend what is currently known about the contribution of these individual proteins to macrophage responses.

      The role of NFE2L2, CEBPB and BCL3 in macrophage polarization and state transition are described in the discussion section. The PMIDs mentioned by the reviewer are added as well. 

      (6) The effect size of NCB knockdown in the in vitro Staph aureus model shown in 4C is fairly small - bacterial killing assays typically require at least a log of difference to demonstrate a convincing effect. It would be helpful for the authors to include a positive control for this experiment (for example, STAT4) to frame the magnitude of their effect.

      We thank the reviewer for the comment, however, we would like to point out that the difference in CFU plotted in log<sub>10</sub> scale, as per common practice. The CFUs are therefore almost halved due to the knockdown in absolute scale and reproduced multiple times with statistically significant results (p-value <0.01). We feel it is sufficient to demonstrate that the NCB geneset by themselves bring out a change in polarization and hence the killing effect. We have used STAT4 as a control for marker measurements as shown in Fig 3C. While carrying out CFU with siSTAT4 may add additional information, we have proceeded to perform the infection experiments with and without the NCB knockdown as that remains the main focus of the study. 

      Minor recommendations:

      (1) Is there a difference between the data represented in Figure 1A-B and Figure S1? If this is the same data, there is no need to repeat it, and Figure 1 could be composed only of the current panels C and D.

      We have removed Figure1 A and B as it illustrates the same point as Figure S1. We have retained Figures C and D and renamed them as new Figure 1A and C. In addition, we have added a new panel Fig 1B (in response to earlier points). 

      (2) Could Figure 2B be represented in a different way? The circles do not contain any readable information about the genes, and it may be less visually overwhelming to represent this with just the large and small triangles. Perhaps the individual genes represented by the circles could be listed in a supplemental table or Excel file.

      We have provided a new Figure 2 A and B panels for the M1 and M2 clusters respectively, which has only the barcode genes along with a functional annotation. The full network is already provided in supplementary data. 

      (3) When indicating the N for all experiments performed in the figure legends, the authors should indicate whether these were technical or biological replicates.

      We appreciate the reviewers for the suggestion. We have indicated what N is for all figure legends.

      (4) Fig 3B: the y-axis is confusing - it appears that normalization is actually to the untreated cells.

      Yes indeed. The normalization is with respect to the untreated cells as per standard practice. We have indicated this clearly in the legend.

      (5) The 72-hour time point in Fig S8 shows unexpected results. Could the authors explain or propose a hypothesis for why CXCL2 and IL1b abruptly decrease while iNOS and MRC1 abruptly increase?

      The purpose of the mentioned experiment was to standardize the time point of M1 polarization post S. aureus  infection. In this regard,  we profiled the expression levels of markers at various time points. We chose to study the 24 hour time point for all the future experiments based on the significant upregulation of NCB seen in the macrophages.  We believe that the 72 hour time point may show effects that are different since the initial immune response would have waned leading to differences in cytokine dynamics. However, as this is not the focus of our study, we are not discussing this aspect further.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Crohn's disease is a prevalent inflammatory bowel disease that often results in patient relapse post anti-TNF blockades. This study employs a multifaceted approach utilizing single-cell RNA sequencing, flow cytometry, and histological analyses to elucidate the cellular alterations in pediatric Crohn's disease patients pre and post-anti-TNF treatment and comparing them with non-inflamed pediatric controls. Utilizing an innovative clustering approach, the research distinguishes distinct cellular states that signify the disease's progression and response to treatment. Notably, the study suggests that the anti-TNF treatment pushes pediatric patients towards a cellular state resembling adult patients with persistent relapses. This study's depth offers a nuanced understanding of cell states in CD progression that might forecast the disease trajectory and therapy response.

      Robust Data Integration: The authors adeptly integrate diverse data types: scRNA-seq, histological images, flow cytometry, and clinical metadata, providing a holistic view of the disease mechanism and response to treatment.

      Novel Clustering Approach: The introduction and utilization of ARBOL, a tiered clustering approach, enhances the granularity and reliability of cell type identification from scRNA-seq data.

      Clinical Relevance: By associating scRNA-seq findings with clinical metadata, the study offers potentially significant insights into the trajectory of disease severity and anti-TNF response; which might help with the personalized treatment regimens.

      Treatment Dynamics: The transition of the pediatric cellular ecosystem towards an adult, more treatment-refractory state upon anti-TNF treatment is a significant finding. It would be beneficial to probe deeper into the temporal dynamics and the mechanisms underlying this transition.

      Comparative Analysis with Adult CD: The positioning of on-treatment biopsies between treatment-naïve pediCD and on-treatment adult CD is intriguing. A more in-depth exploration comparing pediatric and adult cellular ecosystems could provide valuable insights into disease evolution.

      Areas of improvement:

      (1) The legends accompanying the figures are quite concise. It would be beneficial to provide a more detailed description within the legends, incorporating specifics about the experiments conducted and a clearer representation of the data points. 

      We agree that it is beneficial to have descriptive figure legends that balance elements of experimental design, methodology, and statistical analyses employed in order to have a clear understanding throughout the manuscript. We have gone through and clarified areas throughout.  

      (2) Statistical significance is missing from Fig. 1c WBC count plot, Fig. 2 b-e panels. Please provide it even if it's not significant. Also, the legend should have the details of stat test used.

      We have now added details of statistical significance data in the Figure 1 legends. Please note that Mann-Whitney U-test was used for clinical categorical data.

      (3) In the study, the NOA group is characterized by patients who, after thorough clinical evaluations, were deemed to exhibit milder symptoms, negating the need for anti-TNF prescriptions. This mild nature could potentially align the NOA group closer to FGID-a condition intrinsically defined by its low to non-inflammatory characteristics. Such an alignment sparks curiosity: is there a marked correlation between these two groups? A preliminary observation suggesting such a relationship can be spotted in Figure 6, particularly panels A and B. Given the prevalence of FGID among the pediatric population, it might be prudent for the authors to delve deeper into this potential overlap, as insights gained from mild-CD cases could provide valuable information for managing FGID.

      Thank you for this insightful point. On histopathology and endoscopy, the NOA exhibited microscopic and macroscopic inflammation which landed these patients with the CD diagnosis, albeit mild on both micro and macro accounts. By contrast, the FGID group by definition will not have inflammation of microscopic and macroscopic evaluation. There is great interest in the field of adult and pediatric gastroenterology to understand why patients develop symptoms without evidence of inflammation. However, in 2023 the diagnostic tools of endoscopy with biopsy and histopathology is not sensitive enough to detect transcript level inflammation, positioning single-cell technology to be able to reveal further information in both disease processes.

      Based on the reviewer’s suggestions, we have calculated a heatmap of overlapping NOA and FGID cell states along the Figure 6a joint-PC1, showing where NOA CD patients and FGID patients overlap in terms of cell states. This is displayed in Supplemental Figure 15d. This revealed a set of T, Myeloid, and Epithelial cell states that were most important in describing variance along the FGID-CD axis, allowing us to hone in on similarities at the boundary between FGID and CD. By comparing the joint cell states with CD atlas curated cluster names, we identified CCR7-expressing T cell states and GSTA2-expressing epithelial states associated with this overlap. 

      (4) Furthermore, Figure 7 employs multi-dimensional immunofluorescence to compare CD, encompassing all its subtypes, with FGID. If the data permits, subdividing CD into PR, FR, and NOA for this comparison could offer a more nuanced understanding of the disease spectrum. Such a granular perspective is invaluable for clinical assessments. The key question then remains: do the sample categorizations for the immunofluorescence study accommodate this proposed stratification?

      Thank you for the thoughtful discussion. We agree that stratifying Crohn’s disease by PR, FR, and NOA would provide valuable clinical insight. Unfortunately our multiplex IF cohort was designed to maximize overall CD versus FGID comparisons and does not contain enough samples in patient subgroups to power such an analysis. We have highlighted this limitation in the text.  

      (5)The study's most captivating revelation is the proximity of anti-TNF-treated pediatric CD (pediCD) biopsies to adult treatment-refractory CD. Such an observation naturally raises the question: How does this alignment compare to a standard adult colon, and what proportion of this similarity is genuinely disease-specific versus reflective of an adult state? To what degree does the similarity highlight disease-specific traits?

      Delving deeper, it will be of interest to see whether anti-TNF treatment is nudging the transcriptional state of the cells towards a more mature adult stage or veering them into a treatment-resistant trajectory. If anti-TNF therapy is indeed steering cells toward a more adult-like state, it might signify a natural maturation process; however, if it's directing them toward a treatment-refractory state, the long-term therapeutic strategies for pediatric patients might need reconsideration.

      Thank you to the reviewer for another insightful point. We agree that age-matched samples are critical to evaluate disease cell states and hence we have age-matched controls in our pediatric cohort. Our timeline of follow-up only spans 3 years and patients remain in the pediatric age range at times of follow-up endoscopy and biopsy and would not be reflective of an adult GI state. We believe that the cellular behavior from naïve to treatment biopsy to on treatment biopsy is reflective of disease state rather than movement towards and adult-like state. We would also like to point out that pediatric onset IBD (Crohn’s and ulcerative colitis) traditionally has been harder to treat and presents with more extensive disease state (PMID: 22643596) and the ability to detect need for therapy escalation/change would be an invaluable tool for clinicians.  

      We share the reviewer’s interest in disentangling a natural maturation process from disease and treatment-specific changes. Because the patients who were not given treatment did not move towards the adult-like phenotype, it could point to a push towards a treatment-resistant trajectory. To further support these findings, we generated a new disease-pseudotime figure Supplemental Figure 17, using cross-validation methods and the TradeSeq package. This figure was designed to track how each pediatric sample shifts from the treatment-naïve state through antiTNF therapy and to test the robustness of these shifts across samples. The new visualizations show patterns that do not recapitulate natural aging processes but rather shifts across all cell types associated with antiTNF treatment.

      Reviewer #2 (Public Review):

      Summary:

      Through this study, the authors combine a number of innovative technologies including scRNAseq to provide insight into Crohn's disease. Importantly samples from pediatric patients are included. The authors develop a principled and unbiased tiered clustering approach, termed ARBOL. Through high-resolution scRNAseq analysis the authors identify differences in cell subsets and states during pediCD relative to FGID. The authors provide histology data demonstrating T cell localisation within the epithelium. Importantly, the authors find anti-TNF treatment pushes the pediatric cellular ecosystem toward an adult state.

      Strengths:

      This study is well presented. The introduction clearly explains the important knowledge gaps in the field, the importance of this research, the samples that are used, and study design.

      The results clearly explain the data, without overstating any findings. The data is well presented. The discussion expands on key findings and any limitations to the study are clearly explained.

      I think the biological findings from, and bioinformatic approach used in this study, will be of interest to many and significantly add to the field.

      Weaknesses:

      (1) The ARBOL approach for iterative tiered clustering on a specific disease condition was demonstrated to work very well on the datasets generated in this study where there were no obvious batch effects across patients. What if strong batch effects are present across donors where PCA fails to mitigate such effects? Are there any batch correction tools implemented in ARBOL for such cases?

      We thank the reviewer for their insightful point, the full extent to which ARBOL can address batch effects requires further study. To this end we integrated Harmony into the ARBOL architecture and used it in the paper to integrate a previous study with the data presented (Figure 8). We have added to ARBOL’s github README how to use Harmony with the automated clustering method. With ARBOL, as well as traditional clustering methods, batch effects can cause artifactual clustering at any tier of clustering. Due to iteration, this can cause batch effects to present themselves in a single round of clustering, followed by further rounds of clustering that appear highly similar within each batch subset. Harmony addresses this issue, removing these batch-related clustering rounds. The later arrangement of fine-grained clusters using the bottom-up approach can use the batch-corrected latent space to calculate relationships between cell states, removing the effects from both sides of the algorithm. As stated, the extent to which ARBOL can be used to systematically address these batch effects requires further research, but the algorithmic architecture of ARBOL is well suited to address these effects.

      (2) The authors mentioned that the clustering tree from the recursive sub-clustering contained too much noise, and they therefore used another approach to build a hierarchical clustering tree for the bottom-level clusters based on unified gene space. But in general, how consistent are these two trees?

      Thank you for this thoughtful question. The two tree methodologies are not consistent due to their algorithmic differences, but both are important for several reasons: 

      (1) The clustering tree is top-down, meaning low resolution lineage-related clusters are calculated first. Doublets and quality differences can cause very small clusters of different lineages (endothelial vs fibroblast) to fall under the incorrect lineage at first in the sub clustering tree, but these are recaptured during further sub clustering rounds, and then disentangled by the cluster-centroid tree.

      (2) The hierarchical tree is a rose tree, meaning each branching point can contain several daughter branches, while taxonomies based on distances between species (or cell types in this case) are binary trees with only 2 branches per branching point, because distances between each cluster are unique. Because this taxonomy, or bottom-up, is different from the top-down approach, it is useful to then look at how these bottom-level clusters are similar. To that end, we performed pair-wise differential expression between all end clusters and clustered based on those genes. 

      (3) Calculation of a binary tree represents a quantitative basis for comparing the transcriptomic distance between clusters as opposed to relying on distances calculated within a heuristic manifold such as UMAP or algorithmic similarity space such as cluster definitions based on KNN graphs.

      In practice, this dual view rescues small clusters that may have been mis-grouped by technical artifacts and gives a quantitative distance based hierarchy that can be compared across metadata covariates.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary:

      In their previous publication (Dong et al. Cell Reports 2024), the authors showed that citalopram treatment resulted in reduced tumor size by binding to the E380 site of GLUT1 and inhibiting the glycolytic metabolism of HCC cells, instead of the classical citalopram receptor. Given that C5aR1 was also identified as the potential receptor of citalopram in the previous report, the authors focused on exploring the potential of the immune-dependent anti-tumor effect of citalopram via C5aR1. C5aR1 was found to be expressed on tumor-associated macrophages (TAMs) and citalopram administration showed potential to improve the stability of C5aR1 in vitro. Through macrophage depletion and adoptive transfer approaches in HCC mouse models, the data demonstrated the potential importance of C5aR1-expressing macrophage in the anti-tumor effect of citalopram in vivo. Mechanistically, their in vitro data suggested that citalopram may regulate the phagocytosis potential and polarization of macrophages through C5aR1. Next, they tried to investigate the direct link between citalopram and CD8+T cells by including an additional MASH-associated HCC mouse model. Their data suggest that citalopram may upregulate the glycolytic metabolism of CD8+T cells, probability via GLUT3 but not GLUT1-mediated glucose uptake. Lastly, as the systemic 5-HT level is down-regulated by citalopram, the authors analyzed the association between a low 5-HT and a superior CD8+T cell function against a tumor. Although the data is informative, the rationale for working on additional mechanisms and logical links among different parts is not clear. In addition, some of the conclusion is also not fully supported by the current data. 

      We thank the reviewer for their comprehensive summary of our study and appreciate the valuable feedback. We have made improvements based on these comments, and a detailed response addressing each point is presented below.

      Strengths: 

      The idea of repurposing clinical-in-used drugs showed great potential for immediate clinical translation. The data here suggested that the anti-depression drug, citalopram displayed an immune regulatory role on TAM via a new target C5aR1 in HCC.

      We thank the reviewer for recognizing the strengths of our study.

      Weaknesses: 

      (1) The authors concluded that citalopram had a 'potential immune-dependent effect' based on the tumor weight difference between Rag-/- and C57 mice in Figure 1. However, tumor weight differences may also be attributed to a non-immune regulatory pathway. In addition, how do the authors calculate relative tumor weight? What is the rationale for using relative one but not absolute tumor weight to reflect the anti-tumor effect? 

      We appreciate your insights into the potential contributions of non-immune regulatory pathways to the observed tumor weight differences between Rag1<sup>-/- </sup>and wild type C57BL/6 mice. Indeed, the anti-tumor effects of citalopram involve non-immune mechanisms. Previously, we have demonstrated the direct effects of citalopram on cancer cell proliferation, apoptosis, and metabolic processes (PMID: 39388353). In this study, we focused on immune-dependent mechanisms, utilizing Rag1<sup>-/- </sup> mice to investigate a potential immune-mediated effect. The relative tumor weight was calculated by assigning an arbitrary value of 1 to the Rag1<sup>-/- </sup> mice in the DMSO treatment group, with all other tumor weights expressed relative to this baseline. As suggested, we have included absolute tumor weight data in the revised Figure 1B, 1E, 1F, and 3B.

      (2) The authors used shSlc6a4 tumor cell lines to demonstrate that citalopram's effects are independent of the conventional SERT receptor (Figure 1C-F). However, this does not entirely exclude the possibility that SERT may still play a role in this context, as it can be expressed in other cells within the tumor microenvironment. What is the expression profiling of Slc6a4 in the HCC tumor microenvironment? In addition, in Figure 1F, the tumor growth of shSlc6a4 in C57 mice displayed a decreased trend, suggesting a possible role of Slc6a4. 

      As suggested, we probed the expression pattern of SERT in HCC and its tumor microenvironment. Using a single cell sequencing dataset of HCC (GSE125449), we revealed that SERT is also expressed by T cells, tumor-associated endothelial cells, and cancer-associated fibroblasts (see revised Figure S2G). Therefore, we cannot fully rule out the possibility that citalopram may influence these cellular components within the TME and contribute to its therapeutic effects. In the revised manuscript, we have included and discussed this result. In Figure 1F, SERT knockdown led to a 9% reduction in tumor growth, however, this difference was not statistically significant (0.619 ± 0.099 g vs. 0.594 ± 0.129 g; p = 0.75).

      (3) Why did the authors choose to study phagocytosis in Figures 3G-H? As an important player, TAM regulates tumor growth via various mechanisms. 

      We choose to investigate phagocytosis because citalopram targets C5aR1-expressing TAM. C5aR1 is a receptor for the complement component C5a, which plays a crucial role in mediating the phagocytosis process in macrophages. In the revised manuscript, we have highlighted this rationale.

      (4) The information on unchanged deposition of C5a has been mentioned in this manuscript (Figures 3D and 3F), the authors should explain further in the manuscript, for example, C5a could bind to receptors other than C5aR1 and/or C5a bind to C5aR1 by different docking anchors compared with citalopram.

      Thank you for your insightful comment. In Figure 3D, tumor growth was attenuated in C5ar1<sup>-/-</sup> recipients compared with C5ar1<sup>-/-</sup> recipients, whereas C5a deposition remained unchanged. This suggests that while C5a is still present, its interaction with C5aR1 is critical for influencing tumor growth dynamics. In Figure 3F, C5a deposition was not affected by citalopram treatment. Indeed, docking analysis and DARTS assay revealed that citalopram binds to the D282 site of C5aR1. Previous report has shown that mutations on E199 and D282 reduce C5a binding affinity to C5aR1 (PMID: 37169960). Therefore, the impact of citalopram is primarily on C5a/C5aR1 interactions and downstream signaling pathways, rather than on altering C5a levels. In the revised manuscript, we have included this interpretation.

      (5) Figure 3I-M - the flow cytometry data suggested that citalopram treatment altered the proportions of total TAM, M1 and M2 subsets, CD4<sup>+</sup> and CD8<sup>+</sup>T cells, DCs, and B cells. Why does the author conclude that the enhanced phagocytosis of TAM was one of the major mechanisms of citalopram? As the overall TAM number was regulated, the contribution of phagocytosis to tumor growth may be limited. 

      We thank the reviewer’s valuable input. Indeed, recent studies have demonstrated that targeting C5aR1<sup>+</sup> TAMs can induce many anti-tumor effects, such as macrophage polarization and CD8<sup>+</sup> T cell infiltration (PMID: 30300579, PMID: 38331868, and PMID: 38098230). In the revised manuscript, we have clarified our conclusion to better articulate the relationship between citalopram treatment, TAM populations, and their phagocytic activity, with particular emphasis on the role of CD8<sup>+</sup> T cells. For macrophage phagocytosis, one possible explanation is that citalopram targets C5aR1 to enhance macrophage phagocytosis and subsequent antigen presentation and/or cytokine production, which promotes T cell recruitment and activity as well as modulate other aspects of tumor immunity. Given that the anti-tumor effects of citalopram are largely dependent on CD8<sup>+</sup> T cells, we conclude that CD8<sup>+</sup> T cells are essential for the effector mechanisms of citalopram.

      (6) Figure 4 - what is the rationale for using the MASH-associated HCC mouse model to study metabolic regulation in CD8<sup>+</sup> T cells? The tumor microenvironment and tumor growth would be quite different. In addition, how does this part link up with the mechanisms related to C5aR1 and TAM? The authors also brought GLUT1 back in the last part and focused on CD8<sup>+</sup> T cell metabolism, which was totally separated from previous data. 

      We chose the MASH-associated HCC mouse model because it closely mimics the etiology of metabolic-associated fatty liver disease (MAFLD), which is a significant contributor to the development of cirrhosis and HCC. In addition to the MASH-associated HCC mouse model, the study also incorporated the orthotopic Hepa1-6 tumor model. In our previous publication (Dong et al., Cell Reports 2024), we employed both of these HCC models. Therefore, we utilized the same two mouse models in this study. The inclusion of CD8<sup>+</sup> T cells in our study is based on the understanding that citalopram targets GLUT1, which plays a crucial role in glucose uptake (PMID: 39388353). CD8<sup>+</sup>T cell function is heavily reliant on glycolytic metabolism, making it essential to investigate how citalopram’s effects on GLUT1 influence the metabolic pathways and functionality of these immune cells. In this study, we identified that the primary glucose transporter in CD8<sup>+</sup> T cells is GLUT3, rather than GLUT1. The data presented in Figure 4 aim to illustrate the additional effect of citalopram on peripheral 5-HT levels, which, in turn, influences CD8<sup>+</sup> T cell functionality. By linking these findings, we clarify how citalopram impacts both TAMs and CD8<sup>+</sup> T cells. CD8<sup>+</sup> T cells can be influenced by citalopram through various mechanisms, including TAM-dependent mechanisms, reduced systemic serum 5-HT concentrations, and unidentified direct effects. In the revised manuscript, we have enhanced the background information to avoid any gaps.

      (7) Figure 5, the authors illustrated their mechanism that citalopram regulates CD8<sup>+</sup> T cell anti-tumor immunity through proinflammatory TAM with no experimental evidence. Using only CD206 and MHCII to represent TAM subsets obviously is not sufficient. 

      Thank you for your valuable comments. As noted by the reviewer, TAMs can influence CD8<sup>+</sup> T cell anti-tumor immunity through various mechanisms. In this study, we focused on elucidating the impact of citalopram on pro-inflammatory TAMs, which in turn affect CD8<sup>+</sup> T cell anti-tumor immunity and ultimately influence tumor outcomes. Therefore, in the mechanistic diagram, we highlighted the effect of citalopram on pro-inflammatory TAMs, while the causal relationship between TAMs and CD8<sup>+</sup> T cell anti-tumor immunity was indicated with a dotted line due to the limited evidence presented in this study. Additionally, we have expanded our discussion on how citalopram regulates CD8<sup>+</sup> T cell anti-tumor immunity through pro-inflammatory TAMs.

      For the analysis of TAMs, we initially sorted CD45<sup>+</sup>F4/80<sup>+</sup>CD11b<sup>+</sup> cells and assessed M1/M2 polarization by measuring CD206 and MHCII expression. As an added strength, we isolated TAMs from the orthotopic GLUT1<sup>KD</sup> Hepa1-6 model using CD11b microbeads and conducted real-time qPCR analysis of M1-oriented (Il6, Ifnb1, and Nos2) and M2-oriented (Mrc1, Il10, and Arg1) markers. Consistent with our flow cytometry data, the qPCR results confirmed that citalopram induces a pro-inflammatory TAM phenotype (revised Figure S9A).

      Reviewer #2 (Public review): Summary: 

      Dong et al. present a thorough investigation into the potential of repurposing citalopram, an SSRI, for hepatocellular carcinoma (HCC) therapy. The study highlights the dual mechanisms by which citalopram exerts anti-tumor effects: reprogramming tumor-associated macrophages (TAMs) toward an anti-tumor phenotype via C5aR1 modulation and suppressing cancer cell metabolism through GLUT1 inhibition while enhancing CD8+ T cell activation. The findings emphasize the potential of drug repurposing strategies and position C5aR1 as a promising immunotherapeutic target. However, certain aspects of experimental design and clinical relevance could be further developed to strengthen the study's impact. 

      We thank the reviewer’s thoughtful review and constructive feedback. As suggested, we have made improvements based on the feedback provided.

      Strength: 

      It provides detailed evidence of citalopram's non-canonical action on C5aR1, demonstrating its ability to modulate macrophage behavior and enhance CD8+ T cell cytotoxicity. The use of DARTS assays, in silico docking, and gene signature network analyses offers robust validation of drug-target interactions. Additionally, the dual focus on immune cell reprogramming and metabolic suppression presents a thorough strategy for HCC therapy. By emphasizing the potential for existing drugs like citalopram to be repurposed, the study also underscores the feasibility of translational applications. 

      We sincerely appreciate the reviewer’s recognition of the detailed evidence supporting citalopram’s non-canonical action on C5aR1, along with the innovative methodologies employed and the promising potential for repurposing existing drugs in HCC therapy.

      Major weaknesses/suggestions: 

      The dataset and signature database used for GSEA analyses are not clearly specified, limiting reproducibility. The manuscript does not fully explore the potential promiscuity of citalopram's interactions across GLUT1, C5aR1, and SERT1, which could provide a deeper understanding of binding selectivity. The absence of GLUT1 knockdown or knockout experiments in macrophages prevents a complete assessment of GLUT1's role in macrophage versus tumor cell metabolism. Furthermore, there is minimal discussion of clinical data on SSRI use in HCC patients. Incorporating survival outcomes based on SSRI treatment could strengthen the study's translational relevance. 

      By addressing these limitations, the manuscript could make an even stronger contribution to the fields of cancer immunotherapy and drug repurposing. 

      We appreciate the reviewer’s valuable suggestions. As suggested, we have included the following revisions:

      (a) GSEA analyses: For GSEA analyses, we conducted RNA sequencing (RNA-seq) analysis on HCC-LM3 cells treated with citalopram or fluvoxamine, which led to the identification of 114 differentially expressed genes (DEGs; 80 co-upregulated and 34 co-downregulated), as reported previously (PMID: 39388353). These DEGs were then utilized to create an SSRI-related gene signature. Subsequently, we analyzed RNA-seq data from liver HCC (LIHC) samples in The Cancer Genome Atlas (TCGA) cohort, comprising 371 samples, categorizing them into high and low expression groups based on the median expression levels of each candidate target gene (such as C5AR1). Finally, we performed GSEA on the grouped samples (C5AR1-high versus C5AR1-low) using the SSRI-related gene signature. In the revised manuscript, we have included this information in the “Materials and Methods” section.

      (b) Exploration of binding selectivity: We acknowledge the importance of exploring the potential promiscuity of citalopram’s interactions across GLUT1, C5aR1, and SERT1. While we cannot provide further experimental data to support this aspect, we have included the following points in the revised manuscript: 1) We emphasize the significance of exploring the relative binding affinities of citalopram to GLUT1, C5aR1, and SERT, as varying affinities could influence the drug’s overall efficacy. As highlighted in the current manuscript and our previous publication (PMID: 39388353), citalopram interacts with C5aR1 and GLUT1 through distinct binding sites and mechanisms, whereas its interaction with SERT is characterized by a more direct inhibition of serotonin binding (PMID: 27049939). To gain deeper insights into these interactions, employing techniques such as surface plasmon resonance or biolayer interferometry could provide valuable quantitative data on binding kinetics and affinities for each target. 2) We discuss how citalopram’s interactions with multiple targets may contribute to its therapeutic effects, particularly in the context of immune modulation and tumor progression. The potential for citalopram to exhibit diverse mechanisms of action through its interactions with these proteins warrants further investigation. A comprehensive understanding of these pathways could lead to the development of improved therapeutic strategies.

      (c) GLUT1 knockdown in macrophages: In the revised manuscript, we revealed that TAMs predominantly express GLUT3 but not GLUT1 (Figures S8B and S8C). GLUT1 knockdown in THP-1 cells did not significantly impact their glycolytic metabolism (Figure S8D), whereas GLUT3 knockdown led to a marked reduction in glycolysis in THP-1 cells.

      (d) Clinical data on SSRI use in HCC patients: Previously, we have reported that SSRIs use is associated with reduced disease progression in HCC patients (PMID: 39388353) (Cell Rep. 2024 Oct 22;43(10):114818.). As detailed below:

      “We determined whether SSRIs for alleviating HCC are supported by real-world data. A total of 3061 patients with liver cancer were extracted from the Swedish Cancer Register. Among them, 695 patients had been administrated with post-diagnostic SSRIs. The Kaplan-Meier survival analysis suggested that patients who utilized SSRIs exhibited a significantly improved metastasis-free survival compared to those who did not use SSRIs, with a P value of log-rank test at 0.0002. Cox regression analysis showed that SSRI use was associated with a lower risk of metastasis (HR = 0.78; 95% CI, 0.62-0.99)”.

      Reviewer #1 (Recommendations for the authors):

      (1) Add experiments to address the questions listed in the weaknesses.

      As suggested, related experiments are performed to strengthen the conclusions.

      (2) It would be appreciated to show the expression profile of SERT or employ KO mouse models to eliminate the effect of SERT.

      As suggested, analysis of a single-cell sequencing dataset of HCC (GSE125449) revealed that SERT is expressed not only in HCC cells but also in T cells, tumor-associated endothelial cells, and cancer-associated fibroblasts (Figure S2G). Consistently, SERT has been reported as an immune checkpoint restricting CD8 T cell antitumor immunity (PMID: 40403728). Furthermore, SERT KO mice (Cyagen Biosciences, S-KO-02549) was employed to investigate the effects of citalopram. However, the Slc6a4 gene knockout in mice resulted in a significant decrease in 5-HT levels in the brain and a lack of cortical columnar structures. Importantly, the mice exhibited an intolerance to citalopram treatment. Therefore, we did not pursue further investigation into the effects of citalopram in SERT KO mice.

      (3) Due to the concern of specificity and animal health, it would be more direct if the authors could use, for example, C5ar1-fl/fl x Adgre1-Cre mouse models.

      Thank you for your valuable suggestion. We fully agree with your comment regarding the value of introducing C5ar1-fl/fl and Adgre1-Cre mouse models, along with the necessary experimental setups, to substantiate this point. However, in our study, the C5ar1 KO mice exhibited normal overall appearance and viability, indicating that the model is generally healthy. Furthermore, we have validated the specific role of C5aR1 in macrophages through bone marrow reconstitution experiments, reinforcing the importance of C5aR1 in these cells. Therefore, we chose the current model to balance experimental effectiveness with considerations for animal health.

      (4) For example, a GSEA or GO analysis of comparison of macrophages from C5ar1-/- or C5ar1+/- mice may point to the enriched pathway of phagocytosis in macrophages derived from C5ar1-/- rather than C5ar1+/- mice, and this information is helpful for the integrity of this work. Besides, it would be more reliable if a nucleus staining is included in Figures 3G and 3H.

      As suggested, macrophages were isolated from tumor-bearing C5ar1<sup>-/-</sup> and C5ar1<sup>+/-</sup> mice and subsequently analyzed using RNA sequencing. The Gene Set Enrichment Analysis (GSEA) revealed a significant enrichment of the phagocytosis pathway in macrophages derived from C5ar1<sup>-/-</sup> mice compared to those from C5ar1<sup>+/-</sup> mice (see revised Figure S6A). While we acknowledge that the addition of a nucleus staining would enhance reliability, we would like to point out that this style of presentation is also commonly found in articles related to phagocytosis. Furthermore, this experiment involved a significant number of experimental mice, and in accordance with the 3Rs principle for animal experiments, we did not obtain additional sorted TAMs to perform the phagocytosis assay. Thank you for your understanding.

      (5) In line 122, there is a typo, and it should be 'analysis'.

      Thank you for pointing out the typo. It has been corrected to "analysis" in the revised manuscript.

      (6) In line 217, there is no causal relationship between the contexts, and using 'as a result' may lead to misunderstanding.

      As suggested, ‘as a result’ has been removed to avoid any misunderstanding.

      (7) In line 322, please make sure if it should be HBS or PBS.

      It is PBS, and revisions have been made.

      (8) Figure S7, the calculation of cell proportions needs to use a consistent denominator.

      As suggested, we calculated cell proportions using a consistent denominator (CD45<sup>+</sup> cells).

      (9) Figure 4C, label error.

      Thanks for your careful review. It has been corrected to "MASH".

      Reviewer #2 (Recommendations for the authors):

      Dong et al. present compelling evidence for repurposing citalopram, a selective serotonin reuptake inhibitor (SSRI), as a potential therapeutic for hepatocellular carcinoma (HCC). While the concept of SSRI repurposing is not novel, this manuscript provides valuable insights into the drug's dual mechanisms: targeting tumor-associated macrophages (TAMs) via C5aR1 modulation and enhancing CD8+ T cell activity, alongside inhibiting cancer cell metabolism through GLUT1 suppression. The findings underscore the promise of drug repurposing strategies and identify C5aR1 as a noteworthy immunotherapeutic target. Addressing the following points will enhance the manuscript's impact and relevance to cancer immunotherapy.

      Specific Comments:

      (1) The authors identify C5aR1 on TAMs as a direct target of citalopram, independent of its classical SERT target, using drug-induced gene signature network analysis and co-immunofluorescence of CD163+ macrophages with C5aR1. The DARTS assay further supports the binding of C5aR1 to citalopram, complemented by in silico docking analysis adapted from their previous GLUT1 study. Since GLUT1 and SERT1 are transporter proteins while C5aR1 is a GPCR, these heterogeneous binding interactions suggest potential promiscuity in SSRI-target engagement.

      (a) Figure 2A: The authors identify C5aR1 as a target using GSEA but do not specify the dataset used (e.g., cancer or immune cells) or the signature database consulted. Providing this context would enhance reproducibility.

      For GSEA, we performed RNA sequencing (RNA-seq) on HCC-LM3 cells treated with citalopram or fluvoxamine and identified 114 differentially expressed genes (DEGs), which included 80 genes that were co-upregulated and 34 that were co-downregulated, as previously documented (PMID: 39388353). These DEGs were subsequently used to develop an SSRI-related gene signature. We then employed the RNA-seq data from liver hepatocellular carcinoma (LIHC) samples within The Cancer Genome Atlas (TCGA) cohort, which included 371 samples. HCC samples in the TCGA cohort were categorized into high and low expression groups based on the median expression levels of each candidate target gene, such as C5AR1. Finally, we conducted GSEA on the grouped samples (such as C5AR1-high versus C5AR1-low) using the SSRI-related gene signature. For reproducibility, detailed information has been added to the “Materials and Methods” section of the revised manuscript.

      (b) Figure 2F: Given citalopram's reported role in inhibiting GLUT1, a comparative discussion on the relative contributions of GLUT1 inhibition versus C5aR1 modulation in tumor suppression is warranted. Performing a DARTS assay for GLUT1 in THP-1 cells, which express high GLUT1 levels and exhibit upregulation in M1 macrophages (https://doi.org/10.1038/s41467-022-33526-z), would clarify SSRI interactions with macrophage metabolism.

      As suggested, we first investigated citalopram treatment in THP-1 cells. The result showed the glycolytic metabolism of THP-1 cells remained largely unaffected following citalopram treatment, as evidenced by glucose uptake, lactate release, and extracellular acidification rate (ECAR) (Figure S8A). Next, we mined a single cell sequencing datasets of HCC and revealed that TAMs predominantly express GLUT3 but not GLUT1 (Figure S8B). Consistently, Western blotting analysis showed a higher expression of GLUT3 and minimal levels of GLUT1 in THP-1 cells (Figure S8C). Consistently, it has been well documented that GLUT1 expression increased after M1 polarization stimuli an GLUT3 expression increased after M2 stimulation in macrophages (PMID: 37721853, PMID: 36216803). GLUT1 knockdown in THP-1 cells did not significantly impact their glycolytic metabolism (Figure S8D), whereas GLUT3 knockdown led to a marked reduction in glycolysis in THP-1 cells. Based on these findings, we conclude that the effects of citalopram on macrophages are primarily mediated through targeting C5aR1 rather than GLUT1.

      (c) Figures 2H-I: A comparison of drug-protein interactions across GLUT1, C5aR1, and SERT1 would be valuable to identify potential shared or distinct binding features.

      Citalopram exhibits distinct binding characteristics across its various targets, including GLUT1, C5aR1, and its classical target, SERT. In the case of C5aR1, our in silico docking analysis identified two key binding conformations at the orthosteric site. The interactions involved significant electrostatic contacts between citalopram’s amino group and negatively charged residues like E199 and D282. Notably, D282’s accessibility and orientation towards the binding cavity suggest it plays a crucial role in citalopram binding, highlighting the importance of specific amino acid interactions at this site. For GLUT1 (PMID: 39388353), citalopram’s interaction also demonstrated notable hydrophobic contacts, particularly through the fluorophenyl group with residues V328, P385, and L325. The cyanophtalane group penetrated the substrate-binding cavity, indicating that citalopram could occupy a similar binding site as glucose, which is distinct from the binding mechanism observed in C5aR1. The involvement of E380 in both poses for GLUT1 further emphasizes the role of electrostatic interactions in mediating citalopram’s binding to this transporter. In contrast, for SERT (PMID: 27049939), citalopram locks the transporter in an outward-open conformation by occupying the central binding site, which is located between transmembrane helices 1, 3, 6, 8 and 10. This binding directly obstructs serotonin from accessing its binding site, illustrating a more definitive blockade mechanism. Additionally, the allosteric site at SERT, positioned between extracellular loops 4 and 6 and transmembrane helices 1, 6, 10, and 11, enhances this blockade by sterically hindering ligand unbinding, thus providing a clear explanation for the allosteric modulation of serotonin transport. In summary, while citalopram interacts with C5aR1 and GLUT1 through distinct binding sites and mechanisms, its interaction with SERT is characterized by a more straightforward blockade of serotonin binding. The unique structural and functional attributes of each target highlight the versatility of citalopram and suggest that its pharmacological effects may vary significantly depending on the specific protein being targeted. In the revised manuscript, we have included detailed information in the revised manuscript.

      (2) The manuscript presents evidence that citalopram reprograms TAMs to an anti-tumor phenotype, enhancing their phagocytic capacity.

      (a) Bone Marrow Reconstitution Experiments (Figure 3): The use of donor (dC5aR1) and recipient (rC5aR1) mice is significant but requires clarification. Explicitly defining donor and recipient terminology and including a schematic of the experimental design would improve reader comprehension.

      We appreciate your valuable feedback. As suggested, the terminology for donor (dC5aR1) and recipient (rC5aR1) mice was defined: “we injected GLUT1<sup>KD</sup> Hepa1-6 cells into syngeneic recipient C5ar1<sup>-/-</sup> (rC5ar1<sup>-/-</sup> ) mice that had been reconstituted with donor C5ar1<sup>+/-</sup> (dC5ar1<sup>+/-</sup>) or C5ar1<sup>-/-</sup> (dC5ar1<sup>-/-</sup>) bone marrow (BM) cells to analyze the therapeutic effect of citalopram”. Additionally, we have included a schematic of the experimental design to enhance reader comprehension (see revised Figure 3E).

      (b) GLUT1 Knockdown (KD) Tumor Cells: While GLUT1 KD tumor cells are utilized, the authors do not assess GLUT1 KD or knockout (KO) in macrophages. Testing the effect of citalopram on macrophages with GLUT1 KO/KD would help determine the relative importance of C5aR1 versus GLUT1 in mediating SSRI effects.

      As responded above, GLUT1 knockdown in THP-1 cells did not significantly alter their glycolytic metabolism (Figure S8D). This observation can be explained by the predominant expression of GLUT3 in TAMs rather than GLUT1 (Figures S8B and S8C). Indeed, knockdown of GLUT3 led to a significant reduction in glycolysis in THP-1 cells (Figure S8C).

      (c) C5aR1's Pro-Tumoral Role: The authors state that C5aR1 fosters an immunosuppressive microenvironment but omit a discussion of current literature on C5aR1's pro-tumoral role (e.g., https://doi.org/10.1038/s41467-024-48637-y, https://www.nature.com/articles/s41419-024-06500-4, https://doi.org/10.1016/j.ymthe.2023.12.010). Including this background in both the introduction and discussion would contextualize their findings.

      Thanks for your valuable feedback. As suggested, we have revised the manuscript to include discussions on C5aR1’s pro-tumoral role, referencing the suggested studies in both the introduction and discussion sections for better context. As detailed below:

      (1) Targeting C5aR1<sup>+</sup> TAMs effectively reverses tumor progression and enhances anti-tumor response;

      (2) Targeting C5aR1 reprograms TAMs from a protumor state to an antitumor state, promoting the secretion of CXCL9 and CXCL10 while facilitating the recruitment of cytotoxic CD8<sup>+</sup> T cells;

      (3) Moreover, citalopram induces TAM phenotypic polarization towards to a M1 proinflammatory state, which supports anti-tumor immune response within the TME.

      (d) C5aR1 Expression in TAMs: Is C5aR1 expression constitutive in TAMs? Further details on C5aR1 expression dynamics in TAMs under different conditions could strengthen the discussion. Public datasets on TAMs in various states (e.g., https://www.nature.com/articles/s41586-023-06682-5, https://www.cell.com/cell/abstract/S0092-8674(19)31119-5, https://pubmed.ncbi.nlm.nih.gov/36657444/) may offer useful insights.

      Thank you for your valuable suggestions. As suggested, we investigated the expression patterns of C5aR1 in TAMs using a HCC cohort (http://cancer-pku.cn:3838/HCC/). In the study conducted by Qiming Zhang et al. (PMID: 31675496), six distinct macrophage subclusters were identified, with M4-c1-THBS1 and M4-c2-C1QA showing significant enrichment in tumor tissues. M4-c1-THBS1 was enriched with signatures indicative of myeloid-derived suppressor cells (MDSCs), while M4-c2-C1QA exhibited characteristics that resembled those of TAMs as well as M1 and M2 macrophages. Our subsequent analysis revealed that C5aR1 is highly expressed in these two clusters, while expression levels in the other macrophage clusters were notably lower (see revised Figure S3).

      (3) The manuscript shows that citalopram-induced reductions in systemic serotonin levels enhance CD8+ T cell activation and cytotoxicity, as evidenced by increased glycolytic metabolism and elevated IFN-γ, TNF-α, and GZMB expression.

      (a) How CD8+ T cell activation is done in serotonin-deficient environments?

      As reported (PMID: 34524861), one possible explanation is that serotonin may enhance PD-L1 expression on cancer cells, thereby impairing CD8<sup>+</sup> T cell function. A deficiency of serotonin in the tumor microenvironment can delay tumor growth by promoting the accumulation and effector functions of CD8<sup>+</sup> T cells while reducing PD-L1 expression. In addition to the SERT-mediated transport and 5-HT receptor signaling, CD8<sup>+</sup> T cells can express TPH1 (PMID: 38215751, PMID: 40403728), enabling them to synthesize endogenous 5-HT, which activates their activity through serotonylation-dependent mechanisms (PMID: 38215751). In the revised manuscript, we have incorporated these interpretations.

      (4) Suggestions for the model figure revision-C5aR1 in TAMs without Citalopram (Figure 5).

      (a) Including a control scenario depicting receptor status and function in TAMs without citalopram treatment would provide a clearer baseline for understanding citalopram's effects.

      Thank you for your valuable input regarding the model figure revision. We have included a revised mechanism model that depicts the receptor status and function of C5aR1 in TAMs without citalopram treatment, as you suggested.

      (5) Suggestions for addressing clinical relevance.

      The study predominantly uses preclinical mouse models, although some human HCC data is analyzed (Figures 2B and 3O). However, there is no discussion of clinical data on SSRI use in HCC patients.

      Incorporating an analysis of patient survival outcomes based on SSRI treatment (e.g., https://pmc.ncbi.nlm.nih.gov/articles/PMC5444756/, https://pmc.ncbi.nlm.nih.gov/articles/PMC10483320/) would enhance the translational relevance of the findings.

      Previously, we reported that the use of SSRIs is associated with reduced disease progression in HCC patients, based on real-world data from the Swedish Cancer Register (PMID: 39388353). As suggested, we have further discussed the clinical relevance of SSRIs in the revised manuscript. As detailed below:

      “In a study involving 308,938 participants with HCC, findings indicated that the use of antidepressants following an HCC diagnosis was linked to a decreased risk of both overall mortality and cancer-specific mortality (PMID: 37672269). These associations were consistently observed across various subgroups, including different classes of antidepressants and patients with comorbidities such as hepatitis B or C infections, liver cirrhosis, and alcohol use disorders. Similarly, our analysis of real-world data from the Swedish Cancer Register demonstrated that SSRIs are correlated with slower disease progression in HCC patients (PMID: 39388353). Given these insights, antidepressants, especially SSRIs, show significant potential as anticancer therapies for individuals diagnosed with HCC”.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The authors examine the neural correlates of face recognition deficits in individuals with Developmental Prosopagnosia (DP; 'face blindness'). Contrary to theories that poor face recognition is driven by reduced spatial integration (via smaller receptive fields), here the authors find that the properties of receptive fields in face-selective brain regions are the same in typical individuals vs. those with DP. The main analysis technique is population Receptive Field (pRF) mapping, with a wide range of measures considered. The authors report that there are no differences in goodness-of-fit (R2), the properties of the pRFs (neither size, location, nor the gain and exponent of the Compressive Spatial Summation model), nor their coverage of the visual field. The relationship of these properties to the visual field (notably the increase in pRF size with eccentricity) is also similar between the groups. Eye movements do not differ between the groups.

      Strengths:

      Although this is a null result, the large number of null results gives confidence that there are unlikely to be differences between the two groups. Together, this makes a compelling case that DP is not driven by differences in the spatial selectivity of face-selective brain regions, an important finding that directly informs theories of face recognition. The paper is well written and enjoyable to read, the studies have clearly been carefully conducted with clear justification for design decisions, and the analyses are thorough.

      Weaknesses:

      One potential issue relates to the localisation of face-selective regions in the two groups. As in most studies of the neural basis of face recognition, localisers are used to find the face-selective Regions of Interest (ROIs) - OFA, mFus, and pFus, with comparison to the scene-selective PPA. To do so, faces are contrasted against other objects to find these regions (or scenes vs. others for the PPA). The one consistent difference that does emerge between groups in the paper is in the selectivity of these regions, which are less selective for faces in DP than in typical individuals (e.g., Figure 1B), as one might expect. 6/20 prosopagnosic individuals are also missing mFus, relative to only 2/20 typical individuals. This, to me, raises the question of whether the two groups are being compared fairly. If the localised regions were smaller and/or displaced in the DPs, this might select only a subset of the neural populations typically involved in face recognition. Perhaps the difference between groups lies outside this region. In other words, it could be that the differences in prosopagnosic face recognition lie in the neurons that are not able to be localised by this approach. The authors consider in the discussion whether their DPs may not have been 'true DPs', which is convincing (p. 12). The question here is whether the regions selected are truly the 'prosopagnosic brain areas' or whether there is a kind of survivor bias (i.e., the regions selected are normal, but perhaps the difference lies in the nature/extent of the regions. At present, the only consideration given to explain the differences in prosopagnosia is that there may be 'qualitative' differences between the two (which may be true), but I would give more thought to this.

      We acknowledge that face-selective ROIs in DPs, relative to controls, may be smaller, less selective, or altogether missing when traditional methods of localization with fixed thresholds are used (Furl et al, 2011). For this reason - to circumvent potential survivor bias and ensure ROI voxel counts across participants are equated - we used a method of ROI definition whereby each subject’s individual statistical map from the localizer was intersected with a generously-sized group mask for each ROI and the top 20% most category-selective voxels were retained for the pRF analysis (Norman-Haignere et al., 2013; Jiahui et al., 2018). This means that the raw number of voxels per ROI was equal across all participants with respect to the common group space, thereby ensuring a fair comparison even in cases where one group shows diminished category-selectivity. The details of the ROI definition are provided in the Methods at the end of the manuscript. To ensure readers understand our approach, we will also make more explicit mention of this in the main body of the manuscript. 

      With regard to the question of whether face-selective ROIs may be displaced in DPs compared to controls, previous work from the senior author’s lab (Jiahui et al., 2018) shows that, despite exhibiting weaker activations, the peak coordinates of significant clusters in DPs occupy very similar locations to those of controls. And, even if there were indeed slight displacements of face-selective ROIs for some subjects, the group-defined masks used in the present analysis were large enough to capture the majority of the top voxels. In the supplemental materials section, we will include a diagram of the group masks used in our study.

      The reviewer here also points out that more DPs than controls were missing the mFUS region (6/20 DPs vs 2/20 controls; Figure 1C). However, ‘missing’ in this context was not based on face-selectivity but rather a lack of retinotopic tuning. PRFs were fit to all voxels within each ROI - with all subjects starting out with equal voxel counts - and thereafter, voxels for which the variance explained by the pRF model was below 20% were excluded from subsequent analysis. We decided that any ROI with fewer than 10 voxels remaining after thresholding on the pRF fit should be deemed ‘missing’ since we considered the amount of data insufficient to reliably characterize the region’s retinotopic profile. While it may be somewhat interesting that four more DPs than controls were ‘missing’ left mFUS, using this particular set of decision criteria, it is important to keep in mind that left mFUS was just one of six face-selective regions under study. The other five regions, many of which evinced strong fits by the pRF model, were represented comparably in DPs and controls and showed high similarity in the pRF parameters. Furthermore, across most participants, mFUS exhibited a low proportion of retinotopically modulated voxels (defined as voxels with pRF R squared greater than 20%, see Figure 1D). A follow-up analysis showed that the count of voxels surviving pRF R squared thresholding in left mFUS was not significantly correlated with mean pRF size (r(30)=0.23, t=1.28,  p=0.21) indicating that the greater exclusion of DPs in this region is unlikely to have biased the group’s average pRF size.

      The discussion considers the differences between the current study and an unpublished preprint (Witthoft et al, 2016), where DPs were found to have smaller pRFs than typical individuals. The discussion presents the argument that the current results are likely more robust, given the use of images within the pRF mapping stimuli here (faces, objects, etc) as opposed to checkerboards in the prior work, and the use of the CSS model here as opposed to a linear Gaussian model previously. This is convincing, but fails to address why there is a lack of difference in the control vs. DP group here. If anything, I would have imagined that the use of faces in mapping stimuli would have promoted differences between the groups (given the apparent difference in selectivity in DPs vs. controls seen here), which adds to the reliability of the present result. Greater consideration of why this should have led to a lack of difference would be ideal. The latter point about pRF models (Gaussian vs. CSS) does seem pertinent, for instance - could the 'qualitative' difference lead to changes in the shape of these pRFs in prosopagnosia that are better characterised by the CSS model, perhaps? Perhaps more straightforwardly, and related to the above, could differences in the localisation of face-selective regions have driven the difference in prior work compared to here?

      We agree that the use of high-level mapping stimuli (including faces) adds to the reliability of the present results for DPs and could have further emphasized differences between the groups if true differences did, in fact, exist. We speculate on the extent to which the type of mapping stimuli and various other methodological factors (e.g. stimulus size, aperture design, pRF model) could have explained the divergent findings in our study versus that of Witthoft et al. (2016) in the section of the Discussion titled, “What factors may have contributed to the different results for the present study and Witthoft et al. (2016)”. In brief, our use of more colorful, naturalistic stimuli targeting higher-level visual areas elicited better model fits than the black and white checkerboard pattern used by Witthoft et al. (2016). The CSS model we used is better suited for higher-level regions and makes fewer assumptions than the linear pRF model. The field of view of our stimulus was smaller but still relevant for real-world perception of faces. Finally, our aperture design and longer run length likely also improved reliability. Overall, these methodological improvements, along with our larger sample size, provide stronger evidence for our findings. These are our best attempts to make sense of the divergent findings, but it is not possible to come to a definitive explanation. Examples abound of exaggerated or spurious effects from small-scale studies that ultimately fail to replicate in the related field of dyslexia research (Jednorog et al., 2015; Ramus et al., 2018) and neuroimaging research more generally (Turner et al., 2018; Poldrack et al., 2017). Sometimes there are clear explanations for a lack of replicability (e.g. software bugs, overly flexible preprocessing methods, etc.), but many times the real reason cannot be determined.

      Regarding the type of pRF model deployed, our use of a non-linear exponent (versus a linear model as in the Witthoft et al. (2016) preprint) is unlikely to explain the similarity we observed between the groups in terms of pRF size. Specifically, the groups did not show substantial differences in the exponent by ROI, as seen in Figure 1E, so the use of a linear model should, in theory, produce similar outcomes for the two groups. We will mention this point in the main text.

      Finally, the lack of variations in the spatial properties of these brain regions is interesting in light of the theories that spatial integration is a key aspect of effective face recognition. In this context, it is interesting to note the marked drop in R2 values in face-selective regions like mFus relative to earlier cortex. The authors note in some sense that this is related to the larger receptive field size, but is there a broader point here that perhaps the receptive field model (even with Compressive Spatial Summation) is simply a poor fit for the function of these areas? Could it be that these areas are simply not spatial at all? A broader link between the null results presented here and their implications for theories of face recognition would be ideal.

      The weaker pRF fits found in mFUS, to us, raise the question of whether there is a more effective pRF stimulus for these more anterior regions. For example, it might be possible to obtain higher and more reliable responses there using single isolated faces (Cf. Kay, Weiner, Grill-Spector, 2015). More broadly, though, we agree that it is important to acknowledge that the receptive field model might ultimately be a coarse and incomplete characterization of neural function in these areas. As the other reviewer suggests, one possibility is that other brain processes (e.g. functional or structural connectivity between ROIs) may give rise to holistic face processing in ways that are not captured by pRF properties.

      Reviewer #2 (Public review):

      Summary:

      This is a well-conducted and clearly written manuscript addressing the link between population receptive fields (pRFs) and visual behavior. The authors test whether developmental prosopagnosia (DP) involves atypical pRFs in face-selective regions, a hypothesis suggested by prior work with a small DP sample. Using a larger cohort of DPs and controls, robust pRF mapping with appropriate stimuli and CSS modeling, and careful in-scanner eye tracking, the authors report no group differences in pRF properties across the visual processing hierarchy. These results suggest that reduced spatial integration is unlikely to account for holistic face processing deficits in DP.

      Strengths:

      The dataset quality, sample size, and methodological rigor are notable strengths.

      Weaknesses:

      The primary concern is the interpretation of the results.

      (1) Relationship between pRFs and spatial integration

      While atypical pRF properties could contribute to deficits in spatial integration, impairments in holistic processing in DPs are not necessarily caused by pRF abnormalities. The discussion could be strengthened by considering alternative explanations for reduced spatial integration, such as altered structural or functional connectivity in the face network, which has been reported to underlie DP's difficulties in integrating facial features.

      We agree the Discussion section could benefit from mentioning that alterations to other neural mechanisms, besides pRF organization, could produce deficits in holistic processing. This could take the form of altered functional connectivity (Rosenthal et al., 2017; Lohse et al., 2016; Avidan et al., 2014) or altered structural connectivity (Gomez et al., 2015; Song et al., 2015)

      (2) Beyond the null hypothesis testing framework

      The title claims "normal spatial integration," yet this conclusion is based on a failure to reject the null hypothesis, which does not justify accepting the alternative hypothesis. To substantiate a claim of "normal," the authors would need to provide analyses quantifying evidence for the absence of effects, e.g., using a Bayesian framework.

      We acknowledge that, using frequentist statistical methods, failing to reject the null hypothesis is not sufficient to claim equivalence. For the revision, we will look into additional analyses that could quantify evidence for the null hypothesis. And we will adjust the wording of the title in this regard.

      (3) Face-specific or broader visual processing

      Prior work from the senior author's lab (Jiahui et al., 2018) reported pronounced reductions in scene selectivity and marginal reductions in body selectivity in DPs, suggesting that visual processing deficits in DPs may extend beyond faces. While the manuscript includes PPA as a high-level control region for scene perception, scene selectivity was not directly reported. The authors could also consider individual differences and potential data-quality confounds (tSNR difference between and within groups, several obvious outliers in the figures, etc). For instance, examining whether reduced tSNR in DPs contributed to lower face selectivity in the DP group in this dataset.

      Thank you for this suggestion - we will compare tSNR between the groups as a measure of data quality and we will include these comparisons. A preliminary look indicates that both groups possessed similar distributions of tSNR across many of the face-selective regions investigated here.

      (4) Linking pRF properties to behavior

      The manuscript aims to examine the relationship between pRF properties and behavior, but currently reports only one aspect of pRF (size) in relation to a single behavioral measure (CFMT), without full statistical reporting:

      "We found no significant association between participants' CFMT scores and mean pRF size in OFA, pFUS, or mFUS."

      For comprehensive reporting, the authors could examine additional pRF properties (e.g., center, eccentricity, scaling between eccentricity and pRF size, shape of visual field coverage, etc), additional ROIs (early, intermediate, and category-selective areas), and relate them to multiple behavioral measures (e.g., HEVA, PI20, FFT). This would provide a full picture of how pRF characteristics relate to behavioral performance in DP.

      We will report the full statistical values (r, p) for the (albeit non-significant) relationship between CFMT score and pRF size - thank you for bringing that to our attention. Additionally, we will add other analyses assessing the relationship between a wider array of pRF measures and the other behavioral tests administered to provide a more comprehensive picture of the relation between pRFs and behavior.

      References:

      Avidan, G., Tanzer, M., Hadj-Bouziane, F., Liu, N., Ungerleider, L. G., & Behrmann, M. (2014). Selective Dissociation Between Core and Extended Regions of the Face Processing Network in Congenital Prosopagnosia. Cerebral Cortex, 24(6), 1565–1578. https://doi.org/10.1093/cercor/bht007

      Furl, N., Garrido, L., Dolan, R. J., Driver, J., & Duchaine, B. (2011). Fusiform gyrus face selectivity relates to individual differences in facial recognition ability. Journal of Cognitive Neuroscience, 23(7), 1723–1740. https://doi.org/10.1162/jocn.2010.21545

      Gomez, J., Pestilli, F., Witthoft, N., Golarai, G., Liberman, A., Poltoratski, S., Yoon, J., & Grill-Spector, K. (2015). Functionally Defined White Matter Reveals Segregated Pathways in Human Ventral Temporal Cortex Associated with Category-Specific Processing. Neuron, 85(1), 216–227. https://doi.org/10.1016/j.neuron.2014.12.027

      Jednoróg, K., Marchewka, A., Altarelli, I., Monzalvo Lopez, A. K., van Ermingen-Marbach, M., Grande, M., Grabowska, A., Heim, S., & Ramus, F. (2015). How reliable are gray matter disruptions in specific reading disability across multiple countries and languages? Insights from a large-scale voxel-based morphometry study. Human Brain Mapping, 36(5), 1741–1754. https://doi.org/10.1002/hbm.22734

      Jiahui, G., Yang, H., & Duchaine, B. (2018). Developmental prosopagnosics have widespread selectivity reductions across category-selective visual cortex. Proceedings of the National Academy of Sciences of the United States of America, 115(28), E6418–E6427. https://doi.org/10.1073/pnas.1802246115

      Kay, K. N., Weiner, K. S., Kay, K. N., & Weiner, K. S. (2015). Attention Reduces Spatial Uncertainty in Human Ventral Temporal Cortex Attention Reduces Spatial Uncertainty in Human Ventral Temporal Cortex. Current Biology, 25(5), 595–600. https://doi.org/10.1016/j.cub.2014.12.050

      Lohse, M., Garrido, L., Driver, J., Dolan, R. J., Duchaine, B. C., & Furl, N. (2016). Effective connectivity from early visual cortex to posterior occipitotemporal face areas supports face selectivity and predicts developmental prosopagnosia. Journal of Neuroscience, 36(13), 3821–3828. https://doi.org/10.1523/JNEUROSCI.3621-15.2016

      Norman-Haignere, S., Kanwisher, N., & McDermott, J. H. (2013). Cortical pitch regions in humans respond primarily to resolved harmonics and are located in specific tonotopic regions of anterior auditory cortex. Journal of Neuroscience, 33(50), 19451–19469. https://doi.org/10.1523/JNEUROSCI.2880-13.2013

      Poldrack, R. A., Baker, C. I., Durnez, J., Gorgolewski, K. J., Matthews, P. M., Munafò, M. R., Nichols, T. E., Poline, J. B., Vul, E., & Yarkoni, T. (2017). Scanning the horizon: Towards transparent and reproducible neuroimaging research. Nature Reviews Neuroscience, 18(2), 115–126. https://doi.org/10.1038/nrn.2016.167

      Ramus, F., Altarelli, I., Jednoróg, K., Zhao, J., & Scotto di Covella, L. (2018). Neuroanatomy of developmental dyslexia: Pitfalls and promise. Neuroscience and Biobehavioral Reviews, 84(July 2017), 434–452. https://doi.org/10.1016/j.neubiorev.2017.08.001

      Rosenthal, G., Tanzer, M., Simony, E., Hasson, U., Behrmann, M., & Avidan, G. (2017). Altered topology of neural circuits in congenital prosopagnosia. ELife, 6, 1–20. https://doi.org/10.7554/eLife.25069

      Song, S., Garrido, L., Nagy, Z., Mohammadi, S., Steel, A., Driver, J., Dolan, R. J., Duchaine, B., & Furl, N. (2015). Local but not long-range microstructural differences of the ventral temporal cortex in developmental prosopagnosia. Neuropsychologia, 78, 195–206. https://doi.org/10.1016/j.neuropsychologia.2015.10.010

      Turner, B. O., Paul, E. J., Miller, M. B., & Barbey, A. K. (2018). Small sample sizes reduce the replicability of task-based fMRI studies. Communications Biology, 1(1). https://doi.org/10.1038/s42003-018-0073-z

      Witthoft, N., Poltoratski, S., Nguyen, M., Golarai, G., Liberman, A., LaRocque, K., Smith, M., & Grill-Spector, K. (2016). Reduced spatial integration in the ventral visual cortex underlies face recognition deficits in developmental prosopagnosia. BioRxiv, 1–26.

    1. Author response:

      We would like to thank the reviewers for their valuable feedback on this research.

      Based on the limitations identified across the reviews, we will make four major revisions to this work. We will: (1) run a multi-step experiment to better test the successor representation framework and the predictions made by our model simulations; (2) include a task to explicitly gauge participants’ judgements about the relatedness of the robot features; (3) test additional computational models that may better capture participants’ behavior; and (4) clarify and expand the definition of the inductive bias studied in this work.

      (1) The reviews raised the concern that while we frame our results as being about predictive learning within the successor representation framework, we investigated participants’ behavior on a one-step task that is not well suited to characterizing this form of predictive representation. Moreover, our simulations make predictions about how learning may differ in relatively more naturalistic environments, yet we do not test human participants in these more complex learning contexts. Finally, we found several null results for effects that were predicted by our simulations. This may be because the benefits of the bias are predicted to be more limited in simpler learning environments, and our experiment may not have been sufficiently powered to detect these smaller effects. To address these limitations, we will run a new experiment with a multi-step causal structure, allowing us to better test the SR framework while more comprehensively investigating the predictions of the simulations and improving our power to detect effects that were null in the one-step experiment.

      (2) We argued that the causal-bias parameter may capture idiosyncratic differences in participants’ semantic memory that had an ensuing effect on their learning. However, the reviews identified that we did not explicitly measure participants’ judgements about the relatedness of the robot features to verify that existing conceptual knowledge drove these individual differences. In the new experiment, we will therefore include a task to quantify participants’ individual judgements about the relatedness of the robot features.

      (3) The reviews questioned the suitability of the feature-based model for explaining behavior in the task given that only a subset of participants were best fit by the model, and not all of the model’s behavioral predictions were observed in the human subjects experiment. The reviews suggested alternative models could more validly capture behavior. In the revision, we will therefore consider alternative models (e.g., model-based planning, successor features with decay on weak associations).

      (4) The reviews requested some clarity around our conceptualization of the inductive bias studied in this work, and questioned whether the task sufficiently captured the richness of semantic knowledge that may be required for a “semantic bias.” We acknowledge that the term semantic bias may not be an accurate descriptor of the inductive bias we measured. Instead, a more general “conceptual bias” term may better capture how any hierarchical conceptual knowledge – semantic or otherwise – may drive the studied bias. We will clarify our terminology in the revision.

      In addition to these major revisions, we will address more minor critiques and suggestions raised by individual reviewers.

    1. Author response:

      We thank you and the reviewers for the careful assessment and for the thoughtful public reviews of our manuscript. We are encouraged that the novelty of the observations and the systematic nature of our approach are recognised, and we fully appreciate the concerns raised regarding potential artefacts and the incompletely defined mechanism.

      (1) Context for funding (Reviewer #2)

      In response to Reviewer #2’s note that this study is personally funded by one of the authors, we would like to provide some context. When wefirst observed that high-NaCl treatment caused a reversible loss ofactivation-loop phospho-signal for PKN1, we recognised its potential importance and submitted grant applications specifically to investigate this phenomenon. Unfortunately, these applications were not funded. As a result, as Reviewer #2 correctly points out, we have continued this work only modestly, using a personal donation from one of the authors to the university.

      Our initial view that this phenomenon merited detailed study was based mainly on three points:

      (i) Phosphorylation of the activation-loop threonine is critical for the catalytic activity of these kinases.

      (ii) In previous work on PKN, no stress signal had been identified that could induce such a prominent and rapid change in activation-loop threonine phosphorylation.

      (iii) Although the phenomenon was originally detected under high Na⁺ conditions, if it simply reflected the balance between phosphorylation and dephosphorylation, then it seemed plausible that more physiological changes in ion concentrations might drive signals in cells.

      To explore point (iii), we initially attempted to define the ion concentrations that trigger dephosphorylation under conditions where re-phosphorylation was blocked. However, even with potent kinase inhibitors, we were unable to prevent recovery of the phospho-signal.This unexpected result prompted us to investigate the underlying mechanism of this unusual behaviour in more depth.

      (2) Hidden artefacts and mass-spectrometric approaches  We fully share the reviewers’ concern expressed as “We remain concerned about hidden artifacts.” Throughout this work, we have repeatedly asked ourselves whether the phenomenon could arise from something as trivial as an artefact inherent to immunoblotting or from an unrecognised flaw in our experimental design, or whether it might ultimately be explainable in terms of conventional rules of protein phosphorylation' and 'dephosphorylation'.

      To capture the phenomenon from an additional, independent angle, we agree with the reviewers’ suggestion to attempt mass spectrometry–based analysis. However, there are several substantial technical hurdles:

      (i) At present, the phenomenon strictly requires the presence of animal cell extracts; we have not been able to reproduce it in their absence.

      (ii) When we attempt to repurify the activation-loop fragments after ion treatment, the phosphate group is re-acquired during the wash steps, even when we use the same high-salt buffer employed for ion treatment.

      (iii) In global phosphoproteomic analyses, reliably detecting a specific change in phosphorylation at a defined site is technically demanding and costly.

      We therefore hope to identify conditions under which we can both (a)preserve the phosphorylation state established by the ion treatmentduring sample handling, and (b) achieve sufficient purification for informative mass spectrometric analysis. Reviewer #3 raised an important question regarding the origin of the two bands observed in Figure 6C. At present, we do not have data that would allow us to address this point in a well-founded manner. We hope that successful mass spectrometric analysis will also enable us to comment more concretely on this issue.

      (3) Role of PP2A and reconstitution experimentsAs emphasised by Reviewers #1 and #3, although PP2A appears to beessential for the phenomenon, we have not yet been able to formulate a mechanistically plausible model that incorporates PP2A in a satisfactory way, and we share the reviewers’ concern on this point. We performed preliminary in vitro reconstitution experiments using recombinant PP2A purified from Sf9 cells (comprising the catalytic C subunit, the scaffold A subunit, and GST-fused PR130 as a B subunit) together with purified PKN1 activation loop fragments, to test whether the phenomenon can be reconstituted under low- and high-KCl conditions. Under the conditions tested so far, we have not yet succeeded in reconstituting the salt-dependent loss and recovery of activation loop phosphorylation. In vivo, PP2A holoenzymes exhibit substantial diversity in their subunit composition, particularly in the B subunit, and it is therefore unclear whether the particular complex we used is the one responsible for the behaviour observed in lysates. We plan to test additional PP2A complexes and, in parallel, to examine the effect of adding bacterial cell extracts—which by themselves do not induce changes in activation-loop phosphorylation in our system—in order to determine whether additional eukaryotic factors are required for reconstitution.

      Through these experiments, we hope to move closer to constructing amechanistic scheme that explicitly includes PP2A and clarifies its role in this unusual process of phosphate loss and reacquisition.

      We are grateful for the constructive feedback and believe these planned revisions will strengthen the clarity, balance, and rigour of our study.

    1. Author response:

      The following is the authors’ response to the current reviews.

      I thank the authors for their clarifications. The manuscript is much improved now, in my opinion. The new power spectral density plots and revised Figure 1 are much appreciated. However, there is one remaining point that I am unclear about. In the rebuttal, the authors state the following: "To directly address the question of whether the auditory signal was distracting, we conducted a follow-up MEG experiment. In this study, we observed a significant reduction in visual accuracy during the second block when the distractor was present (see Fig. 7B and Suppl. Fig. 1B), providing clear evidence of a distractor cost under conditions where performance was not saturated." 

      I am very confused by this statement, because both Fig. 7B and Suppl. Fig. 1B show that the visual- (i.e., visual target presented alone) has a lower accuracy and longer reaction time than visual+ (i.e., visual target presented with distractor). In fact, Suppl. Fig. 1B legend states the following: "accuracy: auditory- - auditory+: M = 7.2 %; SD = 7.5; p = .001; t(25) = 4.9; visual- - visual+: M = -7.6%; SD = 10.80; p < .01; t(25) = -3.59; Reaction time: auditory- - auditory +: M = -20.64 ms; SD = 57.6; n.s.: p = .08; t(25) = -1.83; visual- - visual+: M = 60.1 ms ; SD = 58.52; p < .001; t(25) = 5.23)." 

      These statements appear to directly contradict each other. I appreciate that the difficulty of auditory and visual trials in block 2 of MEG experiments are matched, but this does not address the question of whether the distractor was actually distracting (and thus needed to be inhibited by occipital alpha). Please clarify.

      We apologize for mixing up the visual and auditory distractor cost in our rebuttal. The reviewer is right in that our two statements contradict each other.

      To clarify: In the EEG experiment, we see significant distractor cost for auditory distractors in the accuracy (which can be seen in SUPPL Fig. 1A). We also see a faster reaction time with auditory distractors, which may speak to intersensory facilitation. As we used the same distractors for both experiments, it can be assumed that they were distracting in both experiments.

      In our follow-up MEG-experiment, as the reviewer stated, performance in block 2 was higher than in block 1, even though there were distractors present. In this experiment, distractor cost and learning effects are difficult to disentangle. It is possible that participants improved over time for the visual discrimination task in Block 1, as performance at the beginning was quite low. To illustrate this, we divided the trials of each condition into bins of 10 and plotted the mean accuracy in these bins over time (see Author response image 1). Here it can be seen that in Block 2, there is a more or less stable performance over time with a variation < 10 %. In Block 1, both for visual as well as auditory trials, an improvement over time can be seen. This is especially strong for visual trials, which span a difference of > 20%. Note that the mean performance for the 80-90 trial bin was higher than any mean performance observed in Block 2. 

      Additionally, the same paradigm has been applied in previous investigations, which also found distractor costs for the here-used auditory stimuli in blocked and non-blocked designs. See:

      Mazaheri, A., van Schouwenburg, M. R., Dimitrijevic, A., Denys, D., Cools, R., & Jensen, O. (2014). Region-specific modulations in oscillatory alpha activity serve to facilitate processing in the visual and auditory modalities. NeuroImage, 87, 356–362. https://doi.org/10.1016/j.neuroimage.2013.10.052

      Van Diepen, R & Mazaheri, A 2017, 'Cross-sensory modulation of alpha oscillatory activity: suppression, idling and default resource allocation', European Journal of Neuroscience, vol. 45, no. 11, pp. 1431-1438. https://doi.org/10.1111/ejn.13570

      Author response image 1.

      Accuracy development over time in the MEG experiment. During block 1, a performance increase over time can be observed for visual as well as for auditory stimuli. During Block 2, performance is stable over time. Data are presented as mean ± SEM. N = 27 (one participant was excluded from this analysis, as their trial count in at least one condition was below 90 trials).


      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      In this study, Brickwedde et al. leveraged a cross-modal task where visual cues indicated whether upcoming targets required visual or auditory discrimination. Visual and auditory targets were paired with auditory and visual distractors, respectively. The authors found that during the cue-to-target interval, posterior alpha activity increased along with auditory and visual frequency-tagged activity when subjects were anticipating auditory targets. The authors conclude that their results disprove the alpha inhibition hypothesis, and instead implies that alpha "regulates downstream information transfer." However, as I detail below, I do not think the presented data irrefutably disproves the alpha inhibition hypothesis. Moreover, the evidence for the alternative hypothesis of alpha as an orchestrator for downstream signal transmission is weak. Their data serves to refute only the most extreme and physiologically implausible version of the alpha inhibition hypothesis, which assumes that alpha completely disengages the entire brain area, inhibiting all neuronal activity.

      We thank the reviewer for taking the time to provide additional feedback and suggestions and we improved our manuscript accordingly.

      (1) Authors assign specific meanings to specific frequencies (8-12 Hz alpha, 4 Hz intermodulation frequency, 36 Hz visual tagging activity, 40 Hz auditory tagging activity), but the results show that spectral power increases in all of these frequencies towards the end of the cue-to-target interval. This result is consistent with a broadband increase, which could simply be due to additional attention required when anticipating auditory target (since behavioral performance was lower with auditory targets, we can say auditory discrimination was more difficult). To rule this out, authors will need to show a power spectral density curve with specific increases around each frequency band of interest. In addition, it would be more convincing if there was a bump in the alpha band, and distinct bumps for 4 vs 36 vs 40 Hz band.

      This is an interesting point with several aspects, which we will address separately

      Broadband Increase vs. Frequency-Specific Effects:

      The suggestion that the observed spectral power increases may reflect a broadband effect rather than frequency-specific tagging is important. However, Supplementary Figure 11 shows no difference between expecting an auditory or visual target at 44 Hz. This demonstrates that (1) there is no uniform increase across all frequencies, and (2) the separation between our stimulation frequencies was sufficient to allow differentiation using our method.

      Task Difficulty and Performance Differences:

      The reviewer suggests that the observed effects may be due to differences in task difficulty, citing lower performance when anticipating auditory targets in the EEG study. This issue was explicitly addressed in our follow-up MEG study, where stimulus difficulty was calibrated. In the second block—used for analysis—accuracy between auditory and visual targets was matched (see Fig. 7B). The replication of our findings under these controlled conditions directly rules out task difficulty as the sole explanation. This point is clearly presented in the manuscript.

      Power Spectrum Analysis:

      The reviewer’s suggestion that our analysis lacks evidence of frequency-specific effects is addressed directly in the manuscript. While we initially used the Hilbert method to track the time course of power fluctuations, we also included spectral analyses to confirm distinct peaks at the stimulation frequencies. Specifically, when averaging over the alpha cluster, we observed a significant difference at 10 Hz between auditory and visual target expectation, with no significant differences at 36 or 40 Hz in that cluster. Conversely, in the sensor cluster showing significant 36 Hz activity, alpha power did not differ, but both 36 Hz and 40 Hz tagging frequencies showed significant effects These findings clearly demonstrate frequency-specific modulation and are already presented in the manuscript.

      (2) For visual target discrimination, behavioral performance with and without the distractor is not statistically different. Moreover, the reaction time is faster with distractor. Is there any evidence that the added auditory signal was actually distracting?

      We appreciate the reviewer’s observation regarding the lack of a statistically significant difference in behavioral performance for visual target discrimination with and without the auditory distractor. While this was indeed the case in our EEG experiment, we believe the absence of an accuracy effect may be attributable to a ceiling effect, as overall visual performance approached 100%. This high baseline likely masked any subtle influence of the distractor.

      To directly address the question of whether the auditory signal was distracting, we conducted a follow-up MEG experiment. In this study, we observed a significant reduction in visual accuracy during the second block when the distractor was present (see Fig. 7B and Suppl. Fig. 1B), providing clear evidence of a distractor cost under conditions where performance was not saturated.

      Regarding the faster reaction times observed in the presence of the auditory distractor, this phenomenon is consistent with prior findings on intersensory facilitation. Auditory stimuli, which are processed more rapidly than visual stimuli, can enhance response speed to visual targets—even when the auditory input is non-informative or nominally distracting (Nickerson, 1973; Diederich & Colonius, 2008; Salagovic & Leonard, 2021). Thus, while the auditory signal may facilitate motor responses, it can simultaneously impair perceptual accuracy, depending on task demands and baseline performance levels.

      Taken together, our data suggest that the auditory signal does exert a distracting influence, particularly under conditions where visual performance is not at ceiling. The dual effect—facilitated reaction time but reduced accuracy—highlights the complexity of multisensory interactions and underscores the importance of considering both behavioral and neurophysiological measures.

      (3) It is possible that alpha does suppress task-irrelevant stimuli, but only when it is distracting. In other words, perhaps alpha only suppresses distractors that are presented simultaneously with the target. Since the authors did not test this, they cannot irrefutably reject the alpha inhibition hypothesis.

      The reviewer’s claim that we did not test whether alpha suppresses distractors presented simultaneously with the target is incorrect. As stated in the manuscript and supported by our data (see point 2), auditory distractors were indeed presented concurrently with visual targets, and they were demonstrably distracting. Therefore, the scenario the reviewer suggests was not only tested—it forms a core part of our design.

      Furthermore, it was never our intention to irrefutably reject the alpha inhibition hypothesis. Rather, our aim was to revise and expand it. If our phrasing implied otherwise, we have now clarified this in the manuscript. Specifically, we propose that alpha oscillations:

      (a) Exhibit cyclic inhibitory and excitatory dynamics;

      (b) Regulate processing by modulating transfer pathways, which can result in either inhibition or facilitation depending on the network context.

      In our study, we did not observe suppression of distractor transfer, likely due to the engagement of a supramodal system that enhances both auditory and visual excitability. This interpretation is supported by prior findings (e.g., Jacoby et al., 2012), which show increased visual SSEPs under auditory task load, and by Zhigalov et al. (2020), who found no trial-by-trial correlation between alpha power and visual tagging in early visual areas, despite a general association with attention.

      Recent evidence (Clausner et al., 2024; Yang et al., 2024) further supports the notion that alpha oscillations serve multiple functional roles depending on the network involved. These roles include intra- and inter-cortical signal transmission, distractor inhibition, and enhancement of downstream processing (Scheeringa et al., 2012; Bastos et al., 2015; Zumer et al., 2014). We believe the most plausible account is that alpha oscillations support both functions, depending on context.

      To reflect this more clearly, we have updated Figure 1 to present a broader signal-transfer framework for alpha oscillations, beyond the specific scenario tested in this study.

      We have now revised Figure 1 and several sentences in the introduction and discussion, to clarify this argument.

      L35-37: Previous research gave rise to the prominent alpha inhibition hypothesis, which suggests that oscillatory activity in the alpha range (~10 Hz) plays a mechanistic role in selective attention through functional inhibition of irrelevant cortical areas (see Fig. 1; Foxe et al., 1998; Jensen & Mazaheri, 2010; Klimesch et al., 2007).

      L60-65: In contrast, we propose that functional and inhibitory effects of alpha modulation, such as distractor inhibition, are exhibited through blocking or facilitating signal transmission to higher order areas (Peylo et al., 2021; Yang et al., 2023; Zhigalov & Jensen, 2020; Zumer et al., 2014), gating feedforward or feedback communication between sensory areas (see Fig. 1; Bauer et al., 2020; Haegens et al., 2015; Uemura et al., 2021).

      L482-485: This suggests that responsiveness of the visual stream was not inhibited when attention was directed to auditory processing and was not inhibited by occipital alpha activity, which directly contradicts the proposed mechanism behind the alpha inhibition hypothesis.

      L517-519: Top-down cued changes in alpha power have now been widely viewed to play a functional role in directing attention: the processing of irrelevant information is attenuated by increasing alpha power in areas involved with processing this information (Foxe, Simpson, & Ahlfors, 1998; Hanslmayr et al., 2007; Jensen & Mazaheri, 2010).

      L566-569: As such, it is conceivable that alpha oscillations can in some cases inhibit local transmission, while in other cases, depending on network location, connectivity and demand, alpha oscillation can facilitate signal transmission. This mechanism allows to increase transmission of relevant information and to block transmission of distractors.

      (4) In the abstract and Figure 1, the authors claim an alternative function for alpha oscillations; that alpha "orchestrates signal transmission to later stages of the processing stream." In support, the authors cite their result showing that increased alpha activity originating from early visual cortex is related to enhanced visual processing in higher visual areas and association areas. This does not constitute a strong support for the alternative hypothesis. The correlation between posterior alpha power and frequency-tagged activity was not specific in any way; Fig. 10 shows that the correlation appeared on both 1) anticipating-auditory and anticipating-visual trials, 2) the visual tagged frequency and the auditory tagged activity, and 3) was not specific to the visual processing stream. Thus, the data is more parsimonious with a correlation than a causal relationship between posterior alpha and visual processing.

      Again, the reviewer raises important points, which we want to address

      The correlation between posterior alpha power and frequency-tagged activity was not specific, as it is present both when auditory and visual targets are expected:

      If there is a connection between posterior alpha activity and higher-order visual information transfer, then it can be expected that this relationship remains across conditions and that a higher alpha activity is accompanied by higher frequency-tagged activity, both over trials and over conditions. However, it is possible that when alpha activity is lower, such as when expecting a visual target, the signal-to-noise ratio is affected, which may lead to higher difficulty to find a correlation effect in the data when using non-invasive measurements.

      The connection between alpha activity and frequency-tagged activity appears both for auditory as well as visual stimuli and The correlation is not specific to the visual processing stream:

      While we do see differences between conditions (e.g. in the EEG-analysis, mostly 36 Hz correlated with alpha activity and only in one condition 40 Hz showed a correlation as well), it is true that in our MEG analysis, we found correlations both between alpha activity and 36 Hz as well as alpha activity and 40 Hz.  

      We acknowledge that when analysing frequency-tagged activity on a trial-by-trial basis, where removal of non-timelocked activity through averaging (which we did when we tested for condition differences in Fig. 4 and 9) is not possible, there is uncertainty in the data. Baseline-correction can alleviate this issue, but it cannot offset the possibility of non-specific effects. We therefore decided to repeat the analysis with a fast-fourier calculated power instead of the Hilbert power, in favour of a higher and stricter frequency-resolution, as we averaged over a time-period and thus, the time-domain was not relevant for this analysis. In this more conservative analysis, we can see that only 36 Hz tagged activity when expecting an auditory target correlated with early visual alpha activity.

      Additionally, we added correlation analyses between alpha activity and frequency-tagged activity within early visual areas, using the sensor cluster which showed significant condition differences in alpha activity. Here, no correlations between frequency-tagged activity and alpha activity could be found (apart from a small correlation with 40 Hz which could not be confirmed by a median split; see SUPPL Fig. 14 C). The absence of a significant correlation between early visual alpha and frequency-tagged activity has previously been described by others (Zhigalov & Jensen, 2020) and a Bayes factor of below 1 also indicated that the alternative hypotheses is unlikely.

      Nonetheless, a correlation with auditory signal is possible and could be explained in different ways. For example, it could be that very early auditory feedback in early visual cortex (see for example Brang et al., 2022) is transmitted alongside visual information to higher-order areas. Several studies have shown that alpha activity and visual as well as auditory processing are closely linked together (Bauer et al., 2020; Popov et al., 2023). Inference on whether or how this link could play out in the case of this manuscript expands beyond the scope of this study.

      To summarize, we believe the fact that 36 Hz activity within early visual areas does not correlate with alpha activity on a trial-by-trial basis, but that 36 Hz activity in other areas does, provides strong evidence that alpha activity affects down-stream signal processing.

      We mention this analysis now in our discussion:

      L533-536: Our data provides evidence in favour of this view, as we can show that early sensory alpha activity does not covary over trials with SSEP magnitude in early visual areas, but covaries instead over trials with SSEP magnitude in higher order sensory areas (see also SUPPL. Fig. 14).

      Reviewer #1 (Recommendations for the authors):

      The evidence for the alternative hypothesis, that alpha in early sensory areas orchestrates downstream signal transmission, is not strong enough to be described up front in the abstract and Figure 1. I would leave it in the Discussion section, but advise against mentioning it in the abstract and Figure 1.

      We appreciate the reviewer’s concern regarding the inclusion of the alternative hypothesis—that alpha activity in early sensory areas orchestrates downstream signal transmission—in the abstract and Figure 1. While we agree that this interpretation is still developing, recent studies (Keitel et al., 2025; Clausner et al., 2024; Yang et al., 2024) provide growing support for this framework.

      In response, we have revised the introduction, discussion, and Figure 1 to clarify that our intention is not to outright dismiss the alpha inhibition hypothesis, but to refine and expand it in light of new data. This revision does not invalidate the prior literature on alpha timing and inhibition; rather, it proposes an updated mechanism that may better account for observed effects.

      We have though retained Figure 1, as it visually contextualizes the broader theoretical landscape. while at the same time added further analyses to strengthen our empirical support for this emerging view.

      References:

      Bastos, A. M., Litvak, V., Moran, R., Bosman, C. A., Fries, P., & Friston, K. J. (2015). A DCM study of spectral asymmetries in feedforward and feedback connections between visual areas V1 and V4 in the monkey. NeuroImage, 108, 460–475. https://doi.org/10.1016/j.neuroimage.2014.12.081

      Bauer, A. R., Debener, S., & Nobre, A. C. (2020). Synchronisation of Neural Oscillations and Cross-modal Influences. Trends in cognitive sciences, 24(6), 481–495. https://doi.org/10.1016/j.tics.2020.03.003

      Brang, D., Plass, J., Sherman, A., Stacey, W. C., Wasade, V. S., Grabowecky, M., Ahn, E., Towle, V. L., Tao, J. X., Wu, S., Issa, N. P., & Suzuki, S. (2022). Visual cortex responds to sound onset and offset during passive listening. Journal of neurophysiology, 127(6), 1547–1563. https://doi.org/10.1152/jn.00164.2021

      Clausner T., Marques J., Scheeringa R. & Bonnefond M (2024). Feature specific neuronal oscillations in cortical layers BioRxiv :2024.07.31.605816. https://doi.org/10.1101/2024.07.31.605816

      Diederich, A., & Colonius, H. (2008). When a high-intensity "distractor" is better then a low-intensity one: modeling the effect of an auditory or tactile nontarget stimulus on visual saccadic reaction time. Brain research, 1242, 219–230. https://doi.org/10.1016/j.brainres.2008.05.081

      Haegens, S., Nácher, V., Luna, R., Romo, R., & Jensen, O. (2011). α-Oscillations in the monkey sensorimotor network influence discrimination performance by rhythmical inhibition of neuronal spiking. Proceedings of the National Academy of Sciences of the United States of America, 108(48), 19377–19382. https://doi.org/10.1073/pnas.1117190108

      Jacoby, O., Hall, S. E., & Mattingley, J. B. (2012). A crossmodal crossover: opposite effects of visual and auditory perceptual load on steady-state evoked potentials to irrelevant visual stimuli. NeuroImage, 61(4), 1050–1058. https://doi.org/10.1016/j.neuroimage.2012.03.040

      Keitel, A., Keitel, C., Alavash, M., Bakardjian, K., Benwell, C. S. Y., Bouton, S., Busch, N. A., Criscuolo, A., Doelling, K. B., Dugue, L., Grabot, L., Gross, J., Hanslmayr, S., Klatt, L.-I., Kluger, D. S., Learmonth, G., London, R. E., Lubinus, C., Martin, A. E., … Kotz, S. A. (2025). Brain rhythms in cognition – controversies and future directions. ArXiv. https://doi.org/10.48550/arXiv.2507.15639

      Nickerson R. S. (1973). Intersensory facilitation of reaction time: energy summation or preparation enhancement?. Psychological review, 80(6), 489–509. https://doi.org/10.1037/h0035437

      Popov, T., Gips, B., Weisz, N., & Jensen, O. (2023). Brain areas associated with visual spatial attention display topographic organization during auditory spatial attention. Cerebral cortex (New York, N.Y. : 1991), 33(7), 3478–3489. https://doi.org/10.1093/cercor/bhac285

      Salagovic, C. A., & Leonard, C. J. (2021). A nonspatial sound modulates processing of visual distractors in a flanker task. Attention, perception & psychophysics, 83(2), 800–809. https://doi.org/10.3758/s13414-020-02161-5

      Scheeringa, R., Petersson, K. M., Kleinschmidt, A., Jensen, O., & Bastiaansen, M. C. (2012). EEG α power modulation of fMRI resting-state connectivity. Brain connectivity, 2(5), 254–264. https://doi.org/10.1089/brain.2012.0088

      Spaak, E., Bonnefond, M., Maier, A., Leopold, D. A., & Jensen, O. (2012). Layer-specific entrainment of γ-band neural activity by the α rhythm in monkey visual cortex. Current biology : CB, 22(24), 2313–2318. https://doi.org/10.1016/j.cub.2012.10.020

      Yang, X., Fiebelkorn, I. C., Jensen, O., Knight, R. T., & Kastner, S. (2024). Differential neural mechanisms underlie cortical gating of visual spatial attention mediated by alpha-band oscillations. Proceedings of the National Academy of Sciences of the United States of America, 121(45), e2313304121. https://doi.org/10.1073/pnas.2313304121

      Zhigalov, A., & Jensen, O. (2020). Alpha oscillations do not implement gain control in early visual cortex but rather gating in parieto-occipital regions. Human brain mapping, 41(18), 5176–5186. https://doi.org/10.1002/hbm.25183

      Zumer, J. M., Scheeringa, R., Schoffelen, J. M., Norris, D. G., & Jensen, O. (2014). Occipital alpha activity during stimulus processing gates the information flow to object-selective cortex. PLoS biology, 12(10), e1001965. https://doi.org/10.1371/journal.pbio.1001965

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The paper by Boch and colleagues, entitled Comparative Neuroimaging of the Carnivore Brain: Neocortical Sulcal Anatomy, compares and describes the cortical sulci of eighteen carnivore species, and sets a benchmark for future work on comparative brains. 

      Based on previous observations, electrophysiological, histological and neuroimaging studies and their own observations, the authors establish a correspondence between the cortical sulci and gyri of these species. The different folding patterns of all brain regions are detailed, put into perspective in relation to their phylogeny as well as their potential involvement in cortical area expansion and behavioral differences. 

      Strengths: 

      This is a pioneering article, very useful for comparative brain studies and conducted with great seriousness and based on many past studies. The article is well-written and very didactic. The different protocols for brain collection, perfusion, and scanning are very detailed. The images are self-explanatory and of high quality. The authors explain their choice of nomenclature and labels for sulci and gyri on all species, with many arguments. The opening on ecology and social behavior in the discussion is of great interest and helps to put into perspective the differences in folding found at the level of the different cortexes. In addition, the authors do not forget to put their results into the context of the laws of allometry. They explain, for example, that although the largest brains were the most folded and had the deepest folds in their dataset, they did not necessarily have unique sulci, unlike some of the smaller, smoother brains. 

      Weaknesses: 

      The article is aware of its limitations, not being able to take into account interindividual variability within each species, inter-hemispheric asymmetries, or differences between males and females. However, this does not detract from their aim, which is to lay the foundations for a correspondence between the brains of carnivores so that navigation within the brains of these species can be simplified for future studies. This article does not include comparisons of morphometric data such as sulci depth, sulci wall surface, or thickness of the cortical ribbon around the sulci. 

      We thank the reviewer for their overwhelmingly positive evaluation of our work. As noted by the reviewer, our primary aim was to establish a framework for navigating carnivoran brains to lay the foundation for future research. We are pleased that this objective has been successfully achieved.

      Individual differences

      As the reviewer points out, we do not quantify within-species intraindividual differences, which was a conscious choice. We aimed to emphasise the breadth of species over individuals, as is standard in large-scale comparative anatomy (cf. Heuer et al., 2023, eLife; Suarez et al., 2022, eLife). Following the logic of phylogenetic relationships, the presence of a particular sulcus across related species is also a measure of reliability. We felt safe in this choice, as previous work in both primates and carnivorans has shown that differences across major sulci across individuals are a matter of degree rather than a case of presence or absence (Connolly, 1950, External morphology of the primate brain, C.C. Thomas; Hecht et al., 2019 J Neurosci; Kawamuro 1971 Acta Anat., Kawamuro & Naito, 1977, Acta Anat.). 

      In our revised manuscript, we now include additional individuals for six different species, representing both carnivoran suborders (Feliformia and Caniformia), and within Caniformia, both Arctoidea and Canidae (see revised Table 1 and main changes in text below). These additions confirm that intra-species variation primarily affects sulcal shape rather than the presence or absence of major sulci. Furthermore, the inclusion of additional individuals helped validate some initial observations, for example, confirming that the brown bear's proreal sulcus is more accurately characterised as a branch of the presylvian sulcus.

      Main changes in the revised manuscript:

      Results and discussion, p. 13-14: Presylvian sulcus. Rostral to the pseudo-sylvian fissure, the perisylvian sulcus originates from or close to the rostral lateral rhinal fissure (see Supplementary Note 1 and Figure S2 for ventral view). The sulcus extends dorsally, and we observed a gentle caudal curve in the majority of the species (Figures 2-3, white).

      There were no major variations across species, but we noted a shortened sulcus in the meerkat and Egyptian mongoose and the presence of a secondary branch at the dorsal end that extended rostrally in the Eurasian badger and South American coati brain. The brown bear exhibited an additional sulcus in the frontal lobe, previously labelled as the proreal sulcus (see, e.g., Sienkiewicz et al., 2019); however, its shape closely resembled the secondary branches of the perisylvian sulcus seen in the South American coati and Eurasian badger. Sienkiewicz et al. (2019) also noted that this sulcus merges with the presylvian sulcus in their specimen, consistent with our findings in the left hemisphere of the brown bear and bilaterally in the Ussuri brown bear (see Supplementary Figure S3A, S5A). Given the known gyrencephaly of Ursidae brains with frequent secondary and tertiary sulci (Lyras et al., 2023), we propose that this sulcus represents a branch of the perisylvian sulcus.

      General Discussion, p. 23-24:Regarding individual variability in external brain morphology, previous work in primates and carnivorans has shown that differences across individuals typically affect sulcal shape, depth, or extent, but not the presence of major sulci. This has been reported in diverse contexts, including comparisons between captive and (semi-)wild macaque (Sallet et al., 2011; Testard et al., 2022), different dog breeds (Hecht et al., 2019), domestic cats (Kawamura, 1971b), or selectively bred foxes (Hecht et al., 2021). By including additional individuals for selected species, we extend these findings to a broader range of carnivorans. Notably, we observed no major sulcal differences between closely related species, even when specimens were acquired using different extraction and scanning protocols, for example, across felid clades or among wolf-like canids, further suggesting that substantial within-species variation is unlikely. While a full analysis of interindividual variability lies beyond the scope of this study, our findings support the reliability of the major sulcal patterns described.

      Interhemispheric differences

      Regarding potential inter-hemispheric differences, we have now also created digital atlases of all identified sulci in both hemispheres, which are publicly available at https://git.fmrib.ox.ac.uk/neuroecologylab/carnivore-surfaces. While the manuscript continues to focus primarily on descriptions of the right hemisphere, we now also report observed inter-hemispheric differences where applicable. These differences remain minor and, again, a matter of degree. For example, the complementary quantitative analyses investigating covariation between sulcal length and behavioural traits conducted in the right hemisphere were replicated in the left (Supplementary Figure S6 and related Supplementary tables S1-S3).

      Main changes in the revised manuscript:  

      Materials and Methods, p. 33: We focused on the major lateral and dorsal sulci of the carnivoran brain, but the medial wall and ventral view of the sulci are also described. For consistency, we started by labelling the right hemispheres on the mid-thickness surfaces; these are the hemispheres presented in the manuscript. An exception was made for the jungle cat, for which only the left hemisphere was available and is therefore shown. We aimed to facilitate interspecies comparisons and the exploration of previously undescribed carnivoran brains. To this end, we first created standardized criteria (henceforth referred to as recipes) for identifying each sulcus, drawing from existing literature on carnivoran neuroanatomy, particularly in paleoneurology (Lyras et al., 2023), and our own observations. In addition, we created digital sulcal masks for both hemispheres, which allowed us to test whether the same patterns were observable bilaterally and to further facilitate future research building on our framework. For the Egyptian mongoose, only the right hemisphere was available, and thus, a bilateral comparison was not possible for this species. Anatomical nomenclature primarily follows the recommendations of Czeibert et al (2018); if applicable, alternative names of sulci are provided once.

      Materials and Methods, p. 34-35: We first briefly illustrated the gyri of the carnivoran brain with a focus on gyri that are not present in some species as a consequence of absent sulci to complement our observations. We then summarised the key differences and similarities in sulcal anatomy between species and related them to their ecology and behaviour. To complement this qualitative description, we conducted an initial quantitative analysis of sulcal length data from both hemispheres. 

      To test whether sulcal length covaries with behavioural traits, we fit linear models predicting the relative length of the three target sulci (cruciate, postcruciate, proreal) as a function of forepaw dexterity (low vs.

      high) and sociality (solitary vs cooperative hunting). We measured the absolute length of each sulcus using the wb_command -border-length function from the Connectome Workbench toolkit (Marcus et al., 2011) applied to the manually defined sulcal masks (i.e., border files). Relative sulcal length was calculated by dividing the length of each target sulcus by that of a reference sulcus in the same hemisphere, reducing interspecies variation in brain or sulcal size. Reference sulci were required to be present in all species within a hemisphere and excluded if they were a target sulcus, part of the same functional system (e.g., somatosensory/motor), or anatomically atypical (e.g., the pseudosylvian fissure). This resulted in seven reference sulci for the proreal sulcus (ansate, coronal, marginal, presylvian, retrosplenial, splenial, suprasylvian) and four for the cruciate and postcruciate sulci (marginal, retrosplenial, splenial, suprasylvian). For each target-reference pair, we fit the following linear model: relative length ~ forepaw dexterity + sociality. Models were run separately for left and right hemispheres, with the left serving as a replication test. Associations were considered meaningful if the predictor reached statistical significance (p ≤ .05) in ≥ 75% of reference sulcus models per hemisphere. Additional individuals were not included in the analysis.

      Data and code availability statement, p. 35-36: Generated surfaces of all species and T1-like contrast images of post-mortem samples obtained by the C Generated surfaces of all species and T1-like contrast images of post-mortem samples obtained by the Copenhagen Zoo and the Zoological Society of London (see Table 1) are available at the Digital Brain Zoo of the University of Oxford (Tendler et al., 2022) (https://open.win.ox.ac.uk/DigitalBrainBank/#/datasets/zoo). For all other species, except the domestic cat, the cortical surface reconstructions are available through the same resource. In-vivo data for the domestic cat is available upon request.

      We created, extracted and analysed sulcal length data using the Connectome Workbench toolkit (Marcus et al., 2011), R 4.4.0 (R Core Team, 2023) and Python 3.9.7. Sulcal masks, along with the associated midthickness cortical surface reconstructions for all 32 animals, species-specific behavioural data, and the code used to extract sulcal lengths and perform the statistical analyses are available at: https://git.fmrib.ox.ac.uk/neuroecologylab/carnivore-surfaces

      Further brain measures

      We feel that sulci depth, sulci wall surface, or thickness of the cortical ribbon are measures that vary more across individuals, and we have therefore not included them in the study. In addition, these are measures that are not generally used as betweenspecies comparative measures, whereas sulcal patterning is (cf. Amiez et al., 2019, Nat Comms; Connolly, 1950; Miller et al., 2021, Brain Behav Evol; Radinsky 1975, J Mammal; Radinsky 1969, Ann N Y Acad Sci; Welker & Campos 1963 J. Comp Neurol).

      We, therefore, added them as suggestions for future directions, building on our work.

      Major changes in the revised manuscript:

      Limitations and future directions, p. 25-26: Our findings represent a critical first step for linking brains within and across species for interspecies insights. The present analyses are based on multiple individuals pooled into families and genera, primarily focusing on single representatives per species. Additional individuals for selected species confirmed that intra-species variation is a matter of degree rather than a case of presence or absence of major sulci, but we do not provide an extensive account of the possible range of sulcal shape or other anatomical features. Future studies will aim to systematically investigate interindividual variability in sulcal shape, depth, surface area, or thickness of the cortical ribbon surrounding the sulci, and will extend to more detailed investigations of the medial part of the cortex, as well as the subcortical structures and the cerebellum.The present framework and resulting database also provides the foundation to guide and facilitate future investigations of inter- and intra-species variation in regional brain size.

      Reviewer #2 (Public review): 

      Summary: 

      The authors have completed MRI-based descriptions of the sulcal anatomy of 18 carnivoran species that vary greatly in behaviour and ecology. In this descriptive study, different sulcal patterns are identified in relation to phylogeny and, to some extent, behaviour. The authors argue that the reported differences across families reflect behaviour and electrophysiology, but these correlations are not supported by any analyses. 

      Strengths: 

      A major strength of this paper is using very similar imaging methods across all specimens. Often papers like this rely on highly variable methods so that consistency reduces some of the variability that can arise due to methodology. 

      The descriptive anatomy was accurate and precise. I could readily follow exactly where on the cortical surface the authors referring. This is not always the case for descriptive anatomy papers, so I appreciated the efforts the authors took to make the results understandable for a broader audience. 

      I also greatly appreciate the authors making the images open access through their website. 

      Weaknesses: 

      Although I enjoyed many aspects of this manuscript, it is lacking in any quantitative analyses that would provide more insights into what these variations in sulcal anatomy might mean. The authors do discuss inter-clade differences in relation to behaviour and older electrophysiology papers by Welker, Campos, Johnson, and others, but it would be more biologically relevant to try to calculate surface areas or volumes of cortical fields defined by some of these sulci. For example, something like the endocast surface area measurements used by Sakai and colleagues would allow the authors to test for differences among clades, in relation to brain/body size, or behaviour. Quantitative measurements would also aid significantly in supporting some of the potential correlations hinted at in the Discussion.  

      Although quantitative measurements would be helpful, there are also some significant concerns in relation to the specimens themselves. First, almost all of these are captive individuals. We know that environmental differences can alter neocortical development and humans and nonhuman animals and domestication affects neocortical volume and morphology. Whether captive breeding affects neocortical anatomy might not be known, but it can affect other brain regions and overall brain size and could affect sulcal patterns. Second, despite using similar imaging methods across specimens, fixation varied markedly across specimens. Fixation is unlikely to affect the ability to recognize deep sulci, but variations in shrinkage could nevertheless affect overall brain size and morphology, including the ability to recognize shallow sulci. Third, the sample size = 1 for every species examined. In humans and nonhuman animals, sulcal patterns can vary significantly among individuals. In domestic dogs, it can even vary greatly across breeds. It, therefore, remains unclear to what extent the pattern observed in one individual can be generalized for a species, let alone an entire genus or family. The lack of accounting for inter-individual variability makes it difficult to make any firm conclusions regarding the functional relevance of sulcal patterns. 

      We thank the reviewer for their assessment of our work. The primary aim of this study was to establish a framework for navigating carnivoran brains by providing a comprehensive overview of all major neocortical sulci across eighteen different species. Given the inconsistent nomenclature in the literature and the lack of standardized criteria (“recipes”) for identifying the major sulci, we specifically focused on homogenizing the terminology and creating recipes for their identification. In addition to generating digital cortical surfaces for all brains, we have now also added sulcal masks to further support future research building on this framework. We are pleased that our primary objective is seen as successfully achieved and are delighted to report that, following the reviewer’s recommendations, we have further expanded the dataset by including eight additional species and a second individual for six species, yielding a total of 32 carnivorans from eight carnivoran families (see revised Table 1 for a detailed list).

      The present dataset constitutes the most comprehensive collection of fissiped carnivoran brains to date, encompassing a wide range of land-dwelling species from eight families. It includes diverse representatives, such as both social and solitary mongooses, weasel-like and non-weasel mustelids, and a broad spectrum of canids including wolf-like, fox-like, and more basal forms. Further expanding this already extensive dataset has even led to novel discoveries, such as the felid-specific diagonal sulcus and the unique occipito-temporal sulcal configuration shared by herpestids and hyaenids. 

      Major changes in the revised manuscript:

      Results and discussion, p. 4-5: We labelled the neocortical sulci of twenty-six carnivoran species (see Figure 1) based on reconstructed surfaces and developed standardised criteria (“recipes”) for identifying each major sulcus. For each sulcus, we also created corresponding digital masks. Our study included eleven Feliformia and fifteen Caniformia species from eight different carnivoran families. Within the suborder Caniformia, we examined eight Canidae and seven Arctoidea species. In addition, we describe relative intra-species variation in sulcal shape based on supplementary specimens from six species (see Table 1).

      Overall, of the carnivorans studied, Canidae brains exhibited the largest number of unique major sulci, while the brown bear brain was the most gyrencephalic, with the deepest folds and many secondary sulci (see Figures 2-3; brains are arranged by descending number of major sulci). The brown bear was also the largest animal in the sample. The brains of the smaller species, such as the fennec fox, meerkat or ferret, were the most lissencephalic, with the sulci having fewer undulations or indentations compared to the other species. A similar trend has also been observed in the sulci of the prefrontal cortex in primates (Amiez et al., 2023, 2019). The meerkat and Egyptian mongoose exhibited the smallest number of major sulci but possessed, along with the striped hyena, a unique configuration of sulci in the occipito-temporal cortex. In the following, we describe each sulcus' appearance, the recipes on how to identify them, and provide an overview of the most significant differences across species.

      Results and discussion, p. 11: Diagonal sulcus. The diagonal sulcus is oriented nearly perpendicularly to the rostral portion of the suprasylvian sulcus (Figure 2, Supplementary Figure S2, red). We identified it in all Felidae and in the striped hyena, but it was absent in Herpestidae and all Caniformia species.

      In our sample, the sulcus showed moderate variation in shape and continuity. In the caracal and the second sand cat, it appeared as a detached continuation of the rostral suprasylvian sulcus (Supplementary Figure S3). In the Amur and Persian leopards, the diagonal sulcus merged with the rostral ectosylvian sulcus on the right hemisphere, forming a continuous or bifurcated groove. Similar individual variation has been described in domestic cats (Kawamura, 1971b).

      We respectfully disagree with the reviewer on two accounts, where we believe the revieweris not judging the scope of the current work

      (1) Intra-individual differences & potential confounding factors

      The first is with respect to individual differences relationships. To the best of our knowledge, differences between captive and wild animals, or indeed between individuals, do not affect the presence or absence of any major sulci. No differences in sulcal patterns were detected between captive and (semi-)wild macaques (cf. Sallet et al., 2011, Science; Testard et al., 2022, Sci Adv), different dog breeds (Hecht et al., 2019 J Neurosci) or foxes selectively bred to simulate domestication, compared to controls (Hecht et al., 2021 J. Neurosci). 

      By including additional individuals for selected species in the revised version of our manuscript, we confirm and extend these findings to a broader range of carnivorans. Indeed, we also did not observe major differences between closely related species, even when specimens were collected using different extraction and scanning protocols - for example, across felid clades or wolf-like canids - making substantial individual variation within a species even less likely. Thus, while a comprehensive analysis of interindividual variability is beyond the scope of this study, our observations support the robustness of the major sulcal patterns described here. Moreover, the inclusion of additional individuals also helped validate some initial observations, for example, confirming that the brown bear's proreal sulcus is more accurately characterised as a branch of the presylvian sulcus.

      We do, however, agree with the reviewer that building up a database like ours benefits from providing as much information about the samples as possible to enable these issues to be tested. We, therefore, made sure to include as detailed information as possible, including whether the animals were from captive or wild populations, in our manuscript. 

      Main changes in the revised manuscript: 

      Results and discussion, p. 13-14: Presylvian sulcus. There were no major variations across species, but we noted a shortened sulcus in the meerkat and Egyptian mongoose and the presence of a secondary branch at the dorsal end that extended rostrally in the Eurasian badger and South American coati brain. The brown bear exhibited an additional sulcus in the frontal lobe, previously labelled as the proreal sulcus (see, e.g., Sienkiewicz et al., 2019); however, its shape closely resembled the secondary branches of the perisylvian sulcus seen in the South American coati and Eurasian badger. Sienkiewicz et al. (2019) also noted that this sulcus merges with the presylvian sulcus in their specimen, consistent with our findings in the left hemisphere of the brown bear and bilaterally in the Ussuri brown bear (see Supplementary Figure S3A, S5A). Given the known gyrencephaly of Ursidae brains with frequent secondary and tertiary sulci (Lyras et al., 2023), we propose that this sulcus represents a branch of the perisylvian sulcus.

      Results and discussion, p. 23-24: Regarding individual variability in external brain morphology, previous work in primates and carnivorans has shown that differences across individuals typically affect sulcal shape, depth, or extent, but not the presence of major sulci. This has been reported in diverse contexts, including comparisons between captive and (semi-)wild macaque (Sallet et al., 2011; Testard et al., 2022), different dog breeds (Hecht et al., 2019), domestic cats (Kawamura, 1971b), or selectively bred foxes (Hecht et al., 2021). By including additional individuals for selected species, we extend these findings to a broader range of carnivorans. Notably, we observed no major sulcal differences between closely related species, even when specimens were acquired using different extraction and scanning protocols, for example, across felid clades or among wolf-like canids, further suggesting that substantial within-species variation is unlikely. While a full analysis of interindividual variability lies beyond the scope of this study, our findings support the reliability of the major sulcal patterns described.

      Limitations and future directions, p. 25-26: Our findings represent a critical first step for linking brains within and across species for interspecies insights. The present analyses are based on multiple individuals pooled into families and genera, primarily focusing on single representatives per species. Additional individuals for selected species confirmed that intra-species variation is a matter of degree rather than a case of presence or absence of major sulci, but we do not provide an extensive account of the possible range of sulcal shape or other anatomical features.

      Future studies will aim to systematically investigate interindividual variability in sulcal shape, depth, surface area, or thickness of the cortical ribbon surrounding the sulci, and will extend to more detailed investigations of the medial part of the cortex, as well as the subcortical structures and the cerebellum.The present framework and resulting database also provides the foundation to guide and facilitate future investigations of inter- and intra-species variation in regional brain size.

      (2) Quantification of structure/function relationships

      The second is in the quantification of structure/function relationships. We believe the cortical surfaces, detailed sulci descriptions, and atlases themselves are the main deliverables of this project. We felt it prudent to include some qualitative descriptions of the relationship between sulci as we observed them and behaviours as known from the literature, as a way to illustrate the possibilities that this foundational work opens up. This approach also allowed us to confirm and extend previous findings based on observations from a less diverse range of carnivoran species and families (Radinsky 1968 J Comp Neurol; Radinsky 1969, Ann N Y Acad Sci; Welker & Campos 1963 J Comp Neurol; Welker & Seidenstein, 1959 J Comp Neurol).

      However, a full statistical framework for analysis is beyond the scope of this paper. Our group has previously worked on methods to quantitatively compare brain organization across species - indeed, we have developed a full framework for doing so (Mars et al., 2021, Annu Rev Neurosci), based on the idea that brains that differ in size and morphology should be compared based on anatomical features in a common feature space. Previously, we have used white matter anatomy (Mars et al., 2018, eLife) and spatial transcriptomics (Beauchamp et al., 2021, eLife). The present work presents the foundation for this approach to be expanded to sulcal anatomy, but the full development of it will be the topic of future communications.

      Nevertheless, we now include a preliminary quantitative analysis of the relationship between the relative length of specific sulci and the two behavioural traits of interest. These analyses, which complement the qualitative observations in Figure 5, show that the relative length of the proreal sulcus was consistently greater in highly social, cooperatively hunting species, while no effect of forepaw dexterity was found (Supplementary Table S1). In contrast, both the cruciate and postcruciate sulci were significantly longer in species with high forepaw dexterity, but not related to sociality (Supplementary Tables S2–S3). These findings were consistent across reference sulci used to compute relative sulcal length and replicated in the left hemisphere (see Supplementary Figure S6).

      We also would like to emphasize that we strongly believe that looking at measures of brain organization at a more detailed level than brain size or relative brain size is informative. Although studies correlating brain size with behavioural variables are prominent in the literature, they often struggle to distinguish between competing behavioural hypotheses (Healy, 2021, Adaptation and the Brain, OUP). In contrast, connectivity has a much more direct relationship to behavioural differences across species (Bryant et al., 2024, JoN), as does sulcal anatomy (Amiez et al., 2019, Nat Comms; Miller et al., 2021, Brain Behav Evol). Using our sulcal framework, we observed lineage-specific variations that would be overlooked by analyses focused solely on brain size. Moreover, such measures are less sensitive to the effects of fixation since that will affect brain size but not the presence or absence of a sulcus.

      Main changes in the revised manuscript:

      Results and discussion, p. 16-17: In the raccoon, red panda, coati, and ferret, considerably larger portions of the postcruciate gyrus S1 area appeared to be allocated to representing the forepaw and forelimbs (McLaughlin et al., 1998; Welker and Campos, 1963; Welker and Seidenstein, 1959) when compared to the domestic cat or dog (Dykes et al., 1980; Pinto Hamuy et al., 1956). This aligns with the observation that all species in the present sample with more complex or elongated postcruciate and cruciate sulci configurations display a preference for using their forepaws when manipulating their environment (see e.g., Iwaniuk et al., 1999; Iwaniuk and Whishaw, 1999; Radinsky, 1968; and Figure 5A). Complementary quantitative analyses further support this link, revealing a positive relationship between the relative length of the cruciate and postcruciate sulci and high forepaw dexterity (see Supplementary Figure S6, Tables S2-S3). This is suggestive of a potential link between sulcal morphology and a behavioural specialization in Arctoidea, consistent with earlier observations in otter species (Radinsky, 1968). 

      Results and discussion, p. 21: A distinct proreal sulcus was observed in the frontal lobe of the domestic dog, the African wild dog, wolf, dingo, and bush dog. This may indicate an expansion of frontal cortex in these animals compared to the other species in our sample (Figure 5-6). This aligns with findings from a comprehensive study comparing canid endocasts revealing an expanded proreal gyrus in these animals compared to the fennec fox, red fox and other species of the genus Vulpes (Lyras and Van Der Geer, 2003). The canids with a proreal sulcus also exhibit complex social structures compared to the primarily solitary living foxes (Nowak, 2005; Wilson and Mittermeier, 2009; Wilson, 2000, and see Figure 5).Despite living in social groups, the bat-eared fox, an insectivorous canid, does not possess a proreal sulcus. Its foraging behaviour is best described as spatially or communally coordinated rather than truly cooperative (Macdonald and Sillero-Zubiri, 2004), suggesting that the relationship between sulcal morphology and sociality may be specific to species engaging in active cooperative hunting. Supplementary quantitative analyses also confirm an increase in the relative length of the proreal sulcus

      in cooperatively hunting species Moreover, a previous investigation of Canidae and Felidae brain evolution, using endocasts of extant and extinct species, also suggested a link between the emergence of pack structures and the proreal sulcus in Canidae (Radinsky, 1969). Despite being highly social and living in large social groups (i.e., mobs), meerkats appear to have a relatively small frontal lobe and no proreal sulcus compared to the social Canids (Figure 5), which would suggest that if the presence of a proreal sulcus correlates with complex social behaviour, this is canid-specific.

      General discussion, p. 22-23: Our results revealed several interesting patterns of local variation in sulcal morphology between and within different lineages, and successfully replicate and expand upon prior observations based on more limited sets of species (Radinsky, 1969, 1968; Welker and Campos, 1963; Welker and Seidenstein, 1959). For example, Arctoidea showed relatively complex sulcal anatomy in the somatosensory cortex but low complexity in the occipito-temporal regions. In Canidae and Felidae, we found more complex occipito-temporal sulcal patterns indicative of changes in the amount of cortex devoted to visual and auditory processing in these regions. These observations may be linked to social or ecological factors, such as how the animals interact with objects or each other and their varied foraging strategies. Another example was the differential relative expansion of the neocortex surrounding the cruciate sulcus, which was particularly complex in Arctoidea species that are known to use their paws to manipulate their environment. Consistent with this observation, complementary quantitative analyses of both hemispheres revealed that species with high forepaw dexterity tended to have longer cruciate and postcruciate sulci. Although it has been argued that the cruciate sulcus appeared independently in different lineages and its exact relationship to the location of primary motor areas varies (Radinsky, 1971), our results provide a detailed exploration of the relationship between brain morphology and behavioural preferences across such a range of species.  

      Materials and Methods, p. 33: We focused on the major lateral and dorsal sulci of the carnivoran brain, but the medial wall and ventral view of the sulci are also described. For consistency, we started by labelling the right hemispheres on the mid-thickness surfaces; these are the hemispheres presented in the manuscript. An exception was made for the jungle cat, for which only the left hemisphere was available and is therefore shown. We aimed to facilitate interspecies comparisons and the exploration of previously undescribed carnivoran brains. To this end, we first created standardized criteria (henceforth referred to as recipes) for identifying each sulcus, drawing from existing literature on carnivoran neuroanatomy, particularly in paleoneurology (Lyras et al., 2023), and our own observations.In addition, we created digital sulcal masks for both hemispheres, which allowed us to test whether the same patterns were observable bilaterally and to further facilitate future research building on our framework. For the Egyptian mongoose, only the right hemisphere was available, and thus, a bilateral comparison was not possible for this species. Anatomical nomenclature primarily follows the recommendations of Czeibert et al (2018); if applicable, alternative names of sulci are provided once.

      Materials and Methods, p. 34-35: We first briefly illustrated the gyri of the carnivoran brain with a focus on gyri that are not present in some species as a consequence of absent sulci to complement our observations. We then summarised the key differences and similarities in sulcal anatomy between species and related them to their ecology and behaviour. To complement this qualitative description, we conducted an initial quantitative analysis of sulcal length data from both hemispheres.  To test whether sulcal length covaries with behavioural traits, we fit linear models predicting the relative length of the three target sulci (cruciate, postcruciate, proreal) as a function of forepaw dexterity (low vs.high) and sociality (solitary vs cooperative hunting). We measured the absolute length of each sulcus using the wb_command -border-length function from the Connectome Workbench toolkit (Marcus et al., 2011) applied to the manually defined sulcal masks (i.e., border files). Relative sulcal length was calculated by dividing the length of each target sulcus by that of a reference sulcus in the same hemisphere, reducing interspecies variation in brain or sulcal size. Reference sulci were required to be present in all species within a hemisphere and excluded if they were a target sulcus, part of the same functional system (e.g., somatosensory/motor), or anatomically atypical (e.g., the pseudosylvian fissure). This resulted in seven reference sulci for the proreal sulcus (ansate, coronal, marginal, presylvian, retrosplenial, splenial, suprasylvian) and four for the cruciate and postcruciate sulci (marginal, retrosplenial, splenial, suprasylvian). For each target-reference pair, we fit the following linear model: relative length ~ forepaw dexterity + sociality. Models were run separately for left and right hemispheres, with the left serving as a replication test. Associations were considered meaningful if the predictor reached statistical significance (p ≤ .05) in ≥ 75% of reference sulcus models per hemisphere. Additional individuals were not included in the analysis.

      Data and code availability statement, p. 35-36: Generated surfaces of all species and T1-like contrast images of post-mortem samples obtained by the C Generated surfaces of all species and T1-like contrast images of post-mortem samples obtained by the Copenhagen Zoo and the Zoological Society of London (see Table 1) are available at the Digital Brain Zoo of the University of Oxford (Tendler et al., 2022) (https://open.win.ox.ac.uk/DigitalBrainBank/#/datasets/zoo). For all other species, except the domestic cat, the cortical surface reconstructions are available through the same resource. In-vivo data for the domestic cat is available upon request.

      We created, extracted and analysed sulcal length data using the Connectome Workbench toolkit (Marcus et al., 2011), R 4.4.0 (R Core Team, 2023) and Python 3.9.7. Sulcal masks, along with the associated midthickness cortical surface reconstructions for all 32 animals, species-specific behavioural data, and the code used to extract sulcal lengths and perform the statistical analyses are available at:

      https://git.fmrib.ox.ac.uk/neuroecologylab/carnivore-surfaces

      Reviewer #1 (Recommendations for the authors): 

      I was convinced by your model of labels in the temporal region and the nomenclature used, thanks to your argument concerning the primary auditory area in ferrets located in the gyrus called ectosylvian even though they have no ectosylvian sulcus. While this region raises questions, it seems to me that you make a good case for your labelling. 

      However, I don't understand your arguments in the occipital region regarding the ectomarginal sulcus. In the bear, for example, I don't understand why the caudal part of the marginal sulcus is not referred to as ectomarginal? You say that this sulci is specific to canids.

      Whether in the paragraph describing the ectomarginal sulcus, the marginal sulcus, in the paragraphs on the gyri, or in the paragraph concerning the potential relationship to function, I don't see any argument to support your hypothesis. Especially as there is no information in the literature on the functions in this area of the bear brain as in that of the dog or other related species. 

      You just mention that in Canidae, the ectomarginal "runs between the suprasylvian and marginal sulcus", and I don't see why this is an argument. 

      Could you explain in more detail your choice of label and the specificity you claim to have in the canids of this region? 

      We have now expanded our rationale in the revised manuscript, particularly in the section describing the marginal sulcus, which directly follows the description of the ectomarginal sulcus. In brief, across our sample, including Ursidae and Canidae, we observed variation in whether the caudal marginal sulcus was detached or continuous, or extended further caudally vs ventrally, but no separate additional sulcus resembling the ectomarginal sulcus was seen in any species outside the canid family. We therefore reserve the label ectomarginal sulcus for the distinct structure consistently observed in Canidae and avoid applying it to the detached caudal marginal sulcus observed in Ursidae.

      Main changes in the revised manuscript:

      Results and discussion, p. 10-11: In several species, including the dingo, domestic cat, brown bear and South American coati and further supplementary individuals (Supplementary figure S3B), the caudal portion of the marginal sulcus was detached in one or both hemispheres, which is a frequently reported occurrence (England, 1973; Kawamura, 1971a; Kawamura and Naito, 1978). Potentially due to the similar caudal bend, some authors have labelled the (detached) caudal portion of the marginal sulcus in Ursidae as the ectomarginal sulcus (Lyras et al., 2023, but see e.g., Sienkiewicz et al., 2019); 

      The (detached) caudal marginal sulcus in Ursidae continues the course of the marginal sulcus caudally and/or ventrally and is topologically continuous with it. In contrast, the ectomarginal sulcus in Canidae is an entirely separate sulcus that runs between the suprasylvian and marginal sulci, forming a small, additional arch that is rarely connected to the marginal sulcus (Kawamura and Naito, 1978). This distinction is illustrated, for example, in the dingo and grey wolf. In the dingo, we observed both a detached caudal extension of the marginal sulcus and a distinct ectomarginal sulcus. In both grey wolf specimens, the marginal sulcus extended ventrally in a way that resembled the brown bear, but they also exhibited a clearly separate ectomarginal sulcus, confirming that the two features are not equivalent. In contrast, in the brown bear and Ussuri brown bear (Supplementary Figure S3B), we observed variation in whether the marginal sulcus was detached or continuous, but no separate sulcus resembling the ectomarginal sulcus seen in Canidae.

      Reviewer #2 (Recommendations for the authors): 

      Although I indicated this already, I stress that the lack of quantification is problematic. In its current format, this is a classic descriptive study suitable for an anatomy journal, but even then, the conclusions are highly speculative. I would advise including some quantification of sulcal lengths or depths and surface areas or volumes of individual regions and relate all of those to overall brain size and potential clade differences. Figure 5 hints at some of these putative correlations, but is not an analysis. Some of these correlations are discussed in the manuscript, but without quantification, it is simply more descriptions and some speculative associations that largely parallel and corroborate findings from Radinsky's papers.  In addition to quantification, the authors should consider a more fulsome explanation of the potential confounds and limitations of their data. As alluded to above, there are many sources of variation that were not sufficiently discussed but are critically important for interpreting any putative differences among and within clades.  

      We would like to reiterate that the primary aim of our study was to establish a comprehensive sulcal framework for carnivoran brains. The behavioural and ecological associations were secondary and exploratory, arising from a first application of this framework, and will require further investigation in future studies. 

      We already acknowledged in the initial version of the manuscript that many of our observations were consistent with those previously reported by Radinsky in more limited sets of species. However, we recognise that this point may not have come across clearly. We carefully revised our manuscript to further emphasise that our findings replicate and extend Radinsky’s work in a larger cross-species comparison, showing that our framework also successfully replicates and expands prior work. 

      As detailed in the public reviews, we did not measure overall or relative brain sizes. However, in the revised version of the manuscript, we have now quantified the relationship between sulcal length and its association with forepaw dexterity and sociality to complement the qualitative observations in Figure 5. Although preliminary, we believe that these analyses further showcase the strength of our sulcal framework and its potential for future investigations. 

      We also revised our discussion section to highlight the potential for future studies to build on our framework to systematically investigate interindividual variability in sulcal shape, depth, surface area, or thickness of the cortical ribbon surrounding the sulci. We also added that our framework and accompanying dataset can facilitate and guide future investigations into both inter- and intra-species variation in regional brain size.

      Main changes in the revised manuscript:

      General discussion, p. 22-23: Our results revealed several interesting patterns of local variation in sulcal morphology between and within different lineages, and successfully replicate and expand upon prior observations based on more limited sets of species (Radinsky, 1969, 1968; Welker and Campos, 1963; Welker and Seidenstein, 1959). For example, Arctoidea showed relatively complex sulcal anatomy in the somatosensory cortex but low complexity in the occipito-temporal regions. In Canidae and Felidae, we found more complex occipito-temporal sulcal patterns indicative of changes in the amount of cortex devoted to visual and auditory processing in these regions. These observations may be linked to social or ecological factors, such as how the animals interact with objects or each other and their varied foraging strategies. Another example was the differential relative expansion of the neocortex surrounding the cruciate sulcus, which was particularly complex in Arctoidea species that are known to use their paws to manipulate their environment. Consistent with this observation, complementary quantitative analyses of both hemispheres revealed that species with high forepaw dexterity tended to have longer cruciate and postcruciate sulci. Although it has been argued that the cruciate sulcus appeared independently in different lineages and its exact relationship to the location of primary motor areas varies (Radinsky, 1971), our results provide a detailed exploration of the relationship between brain morphology and behavioural preferences across such a range of species.

      Limitations and future directions, p. 25-26: Our findings represent a critical first step for linking brains within and across species for interspecies insights. The present analyses are based on multiple individuals pooled into families and genera, primarily focusing on single representatives per species. Additional individuals for selected species confirmed that intra-species variation is a matter of degree rather than a case of presence or absence of major sulci, but we do not provide an extensive account of the possible range of sulcal shape or other anatomical features. Future studies will aim to systematically investigate interindividual variability in sulcal shape, depth, surface area, or thickness of the cortical ribbon surrounding the sulci, and will extend to more detailed investigations of the medial part of the cortex, as well as the subcortical structures and the cerebellum. The present framework and resulting database also provides the foundation to guide and facilitate future investigations of inter- and intra-species variation in regional brain size.

      Another point that I did not see raised in the Discussion, but would be important and useful to include is that the authors are lacking specimens for several clades that could show additional differences in neocortical anatomy. For example, no hyaenids or viverrids were represented and an otter and badger are not necessarily representative of all mustelids, the majority of which are weasel-like. One could even argue that the meerkat is not necessarily representative of all herpestids given its behaviour and ecology. Of course, there are also pinnipeds, but they are divergent in many ways, and restricting the analyses to fissiped carnivorans is completely reasonable. Please note that I am not suggesting that the authors go back and try to procure even more species; rather they should emphasize that this is an incomplete survey of fissiped carnivorans. 

      The reviewer’s comments prompted us to further expand our carnivoran brain collection to include a broader range of species, representatives, and individual specimens. Notably, the collection now includes a hyaenid representative, the striped hyena. In addition to the otter and badger, we have added a weasel-like mustelid, the ferret, as well as the solitary Egyptian mongoose to complement the highly social meerkat within Herpestidae. Our felid dataset has also been expanded to include additional small and large wild cats, such as the sand cat and the Bengal tiger. As described above, these additions have led to the discovery of novel sulcal patterns, including the felid-specific diagonal sulcus.

      We now also specify the fissiped families currently missing from the collection, which can be readily incorporated using our existing sulcal framework. The same applies to pinniped species, which we are currently investigating to support broader macro-level comparisons across the order. 

      Main changes in the revised manuscript:

      General discussion, p. 23: Comparative neuroimaging requires balancing the level of anatomical detail with the breadth of species. The present sample represents the most comprehensive collection of fissiped carnivoran brains to date, encompassing a wide range of land-dwelling species from eight families. It includes diverse representatives, such as both social and solitary mongooses, weasel-like and non-weasel mustelids, and a broad array of canids, including wolf-like, fox-like, and more basal forms of canids. The framework and detailed protocols developed in this study are designed to facilitate navigation of additional fissiped species, such as Viverridae, Eupleridae, Mephitidae, Nandiniidae, and

      Prionodontidae. Moreover, the approach can be readily extended to aquatic carnivorans, enabling broader macro-level comparisons across the order.

      Apart from these broader issues, I also found some of the figures difficult to interpret in many instances. For example, the colour scheme used to highlight sulci is not colourblind friendly for Figures 2 and 3. It was also difficult for me to glean much information from Figure 6. I understand that functional regions of the cortex are shown for those species that were subject to electrophysiological studies in the past, but I could not work out how to transfer that data to the other brains. One suggestion for improving this would be to highlight putative cortical regions on the other brains in a lighter shade of the same colours. 

      We have carefully revised our figures to improve clarity and accessibility, particularly for individuals with colour vision deficiencies. Specifically, we have added numerical labels alongside the coloured sulci labels in Figures 2 and 3, as well as in all related supplementary figures (see examples on the following pages). For sulci that merge, such as the marginal, ansate, and coronal sulci, we have used colour combinations that are distinguishable across all major types of colour-blindness. Figure 4 has also been updated with a colour-blind-friendly palette and additional numerical labels for the gyri to further enhance interpretability.

      Regarding Figure 6, we have updated the colour palette to ensure accessibility and have labelled all landmark sulci discussed in the main text using acronyms (e.g., the postcruciate sulcus as the boundary between S1 and M1). This is intended to facilitate the transfer of information between brains and guide orientation for readers less familiar with these structures. While we appreciate the suggestion to highlight putative cortical regions on other brains, we have opted not to do so. Our concern is that such visual cues, even when rendered in lighter shades, may be misinterpreted as established rather than hypothetical regional boundaries. We believe this more conservative approach appropriately reflects the current evidence base and avoids unintentionally overstating the certainty of functional homologies.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Recruitment of neutrophils to the lungs is known to drive susceptibility to infection with M. tuberculosis. In this study, the authors present data in support of the hypothesis that neutrophil production of the cytokine IL-17 underlies the detrimental effect of neutrophils on disease. They claim that neutrophils harbor a large fraction of Mtb during infection, and are a major source of IL-17. To explore the effects of blocking IL-17 signaling during primary infection, they use IL-17 blocking antibodies, SR221 (an inverse agonist of Th17 differentiation), and celecoxib, which they claim blocks Th17 differentiation, and observe modest improvements in bacterial burdens in both WT and IFN-γ deficient mice using the combination of IL-17 blockade with celecoxib during primary infection. Celecoxib enhances control of infection after BCG vaccination.

      Thank you for the summary.

      Strengths:

      The most novel finding in the paper is that treatment with celecoxib significantly enhances control of infection in BCG-vaccinated mice that have been challenged with Mtb. It was already known that NSAID treatments can improve primary infection with Mtb.

      Thank you.

      Weaknesses:

      The major claim of the manuscript - that neutrophils produce IL-17 that is detrimental to the host - is not strongly supported by the data. Data demonstrating neutrophil production of IL17 lacks rigor. 

      Our response: Neutrophil production of IL-17 is supported by two independent methods/ techniques in the current version: 

      (1) Through Flow cytometry- a large fraction of Ly6G<sup>+</sup>CD11b<sup>+</sup> cells from the lungs of Mtb-infected mice were also positive for IL-17 (Fig. 3C).

      (2) IFA co-staining of Ly6G <SUP>+</SUP> cells with IL-17 in the lung sections from Mtb-infected mice (Fig. 3 E_G and Fig. 4H, Fig. 5I). For most of these IFA data, we provide quantified plots to show IL17<SUP>+</SUP>Ly6G<SUP>+</SUP> cells.

      (3) Most importantly, conditions that inhibited IL-17 levels and controlled infection also showed a decline in IL-17 staining in Ly6G<SUP>+</SUP> cells.

      Our efforts on IL-17 ELISPOT assay were not very successful and it needs further standardization. 

      Several independent publications support the production of IL-17 by neutrophils (Li et al. 2010; Katayama et al. 2013; Lin et al. 2011). For example, neutrophils have been identified as a source of IL-17 in human psoriatic lesions (Lin et al. 2011), in neuroinflammation induced by traumatic brain injury (Xu et al. 2023) and in several mouse models of infectious and autoimmune inflammation (Ferretti et al. 2003; Hoshino et al. 2008) (Li et al. 2010).

      The experiments examining the effects of inhibitors of IL-17 on the outcome of infection are very difficult to interpret. First, treatment with IL-17 inhibitors alone has no impact on bacterial burdens in the lung, either in WT or IFN-γ KO mice. This suggests that IL-17 does not play a detrimental role during infection. Modest effects are observed using the combination of IL-17 blocking drugs and celecoxib, however, the interpretation of these results mechanistically is complicated. Celecoxib is not a specific inhibitor of Th17. Indeed, it affects levels of PGE2, which is known to have numerous impacts on Mtb infection separate from any effect on IL-17 production, as well as other eicosanoids. 

      The reviewer correctly says that Celecoxib is not a specific inhibitor of Th17. However, COX2 inhibition does have an effect on IL-17 levels, and numerous reports support this observation (Paulissen et al. 2013; Napolitani et al. 2009; Lemos et al. 2009).

      (1) The detrimental role of IL-17 is obvious in the IFNγ KO experiment, where IL-17 neutralization led to a significant improvement in the lung pathology.

      (2) In the highly susceptible IFNγ KO mice, IL-17 neutralization alone extended the survival of mice by ~10 days.

      (3) IL-17 production independent of IL-23 is known to require PGE2 (Paulissen et al. 2013; Polese et al. 2021). In either WT or IFNγ KO mice, in contrast to IL-17 levels, we observed a decline in IL-23 levels. The PGE2 dependence of IL-17 production is obvious in the WT mice, where celecoxib abrogated IL-17 production.

      (4) While deciding the impact of celecoxib or IL17 inhibition, looking at the cumulative readout of lung CFU, spleen CFU, Ly6G<sup>+</sup> cell recruitment, Ly6G<sup>+</sup> cell-resident Mtb pool and overall pathology, the effects are quite significant.

      (5) Finally, in the revised manuscript, we provide additional results on the effect of SR2211 in BCG-vaccinated animals. It shows the direct impact of IL-17 inhibition on the BCG vaccine efficacy in WT mice.

      Finally, the human data simply demonstrates that neutrophils and IL-17 both are higher in patients who experience relapse after treatment for TB, which is expected and does not support their specific hypothesis. 

      We disagree with the above statement. It also contradicts reviewers’ own assessments in one of the comments below, where a protective role of IL-17 is referred to. The literature lacks consensus in terms of a protective or pathological role of IL-17 in TB. Therefore, it was not expected to see higher IL-17 in patients who experienced relapse, death, or failed treatment outcomes. We do not have evidence from human subjects whether neutrophil-derived IL-17 has a similar pathological role as observed in mice. However, higher IL-17 in failed outcome cases confirm the central theme that IL-17 is pathological in both human and mouse models.

      The use of genetic ablation of IL-17 production specifically in neutrophils and/or IL-17R in mice would greatly enhance the rigor of this study. 

      The reviewer’s point is well-taken. Having a genetic ablation of IL-17 production, specifically in the neutrophils, would be excellent. At present, however, we lack this resource. For the revised manuscript, we include the data with SR2211, a direct inhibitor of RORgt and, therefore, IL-17, in BCG-vaccinated mice.

      The authors do not address the fact that numerous studies have shown that IL-17 has a protective effect in the mouse model of TB in the context of vaccination. 

      Yes, there are a few articles that talk about the protective effect of IL-17 in the mouse model of TB in the context of vaccination (Khader et al. 2007; Desel et al. 2011; Choi et al. 2020). This part was discussed in the original manuscript (in the Introduction section). For the revised manuscript, we also provide results from the experiment where we blocked IL-17 production by inhibiting RORgt using SR2211 in BCG-vaccinated mice. The results clearly show IL-17 as a negative regulator of BCG-mediated protective immunity. We believe some of the reasons for the observed differences could be 1) in our study, we analysed IL-17 levels in the lung homogenates at late phases of infection, and 2) most published studies rely on ex vivo stimulation of immune cells to measure cytokine production, whereas we actually measured the cytokine levels in the lung homogenates. We will elaborate on these points in the revised version.

      Finally, whether and how many times each animal experiment was repeated is unclear.

      We provide the details of the number of experiments in the revised version. Briefly, the BCG vaccination experiment (Figure 1) and BCG vaccination with Celecoxib treatment experiment (Figure 6) were performed twice and thrice, respectively. The IL-17 neutralization experiment (Figure 4) and the SR2211 treatment experiment (Figure 5) were done once. We will add another SR2211 experiment data in the revised version. 

      Reviewer #2 (Public review):

      Summary:

      In this study, Sharma et al. demonstrated that Ly6G+ granulocytes (Gra cells) serve as the primary reservoirs for intracellular Mtb in infected wild-type mice and that excessive infiltration of these cells is associated with severe bacteremia in genetically susceptible IFNγ/- mice. Notably, neutralizing IL-17 or inhibiting COX2 reversed the excessive infiltration of Ly6G+Gra cells, mitigated the associated pathology, and improved survival in these susceptible mice. Additionally, Ly6G+Gra cells were identified as a major source of IL-17 in both wild-type and IFNγ-/- mice. Inhibition of RORγt or COX2 further reduced the intracellular bacterial burden in Ly6G+Gra cells and improved lung pathology.

      Of particular interest, COX2 inhibition in wild-type mice also enhanced the efficacy of the BCG vaccine by targeting the Ly6G+Gra-resident Mtb population.

      Thank you for the summary.

      Strengths:

      The experimental results showing improved BCG-mediated protective immunity through targeting IL-17-producing Ly6G+ cells and COX2 are compelling and will likely generate significant interest in the field. Overall, this study presents important findings, suggesting that the IL-17-COX2 axis could be a critical target for designing innovative vaccination strategies for TB.

      Thank you for highlighting the overall strengths of the study. 

      Weaknesses:

      However, I have the following concerns regarding some of the conclusions drawn from the experiments, which require additional experimental evidence to support and strengthen the overall study.

      Major Concerns:

      (1) Ly6G+ Granulocytes as a Source of IL-17: The authors assert that Ly6G+ granulocytes are the major source of IL17 in wild-type and IFN-γ KO mice based on colocalization studies of Ly6G and IL-17. In Figure 3D, they report approximately 500 Ly6G+ cells expressing IL-17 in the Mtb-infected WT lung. Are these low numbers sufficient to drive inflammatory pathology? Additionally, have the authors evaluated these numbers in IFN-γ KO mice? 

      Thank you for pointing out the numbers in Fig. 3D It was our oversight to label the axis as No. of.  For the observation that Ly6G<sup>+</sup> Gra are the major source of IL-17 in TB, we have used two separate strategies- a) IFA and b) FACS IL17<SUP>+</SUP> Ly6G<SUP>+</SUP> Gra/lung. For this data, only a part of the lung was used. For the revised manuscript, we provide the number of these cells at the whole lung level from Mtb-infected WT mice. Unfortunately, we did not evaluate these numbers in IFN-γ KO mice through FACS.. 

      Our efforts to perform the IL-17 ELISpot assay on the sorted Ly6G<SUP>+</SUP>Gra from the lungs of Mtbinfected WT mice were unsuccessful. However, we provide a quantified representation of IFA of the tissue sections to stress upon the role of Ly6G<SUP>+</SUP> cells in IL-17 production in TB pathogenesis. 

      (2) Role of IL-17-Producing Ly6G Granulocytes in Pathology: The authors suggest that IL-17producing Ly6G granulocytes drive pathology in WT and IFN-γ KO mice. However, the data presented only demonstrate an association between IL-17<SUP>+</SUP> Ly6G cells and disease pathology. To strengthen their conclusion, the authors should deplete neutrophils in these mice to show that IL-17 expression, and consequently the pathology, is reduced.

      Thank you for this suggestion. Neutrophil depletion studies in TB remain inconclusive. In some studies, neutrophil depletion helps the pathogen (Rankin et al. 2022; Pedrosa et al. 2000; Appelberg et al. 1995), and in others, it helps the host (Lovewell et al. 2021; Mishra et al. 2017). One reason for this variability is the stage of infection when neutrophil depletion was done. However, another crucial factor is the heterogeneity in the neutrophil population. There are reports that suggest neutrophil subtypes with protective versus pathological trajectories (Nwongbouwoh Muefong et al. 2022; Lyadova 2017; Hellebrekers, Vrisekoop, and Koenderman 2018; Leliefeld et al. 2018). Depleting the entire population using anti-Ly6G could impact this heterogeneity and may impact the inferences drawn. 

      A better approach would be to characterise this heterogeneous population, efforts towards which could be part of a separate study. Another direct approach could be Ly6G<SUP>+</SUP>-specific deletion of IL-17 function as part of a separate study.

      For the revised manuscript, we provide results from the SR2211 experiment in BCG-vaccinated mice and other results to show the role of IL-17-producing Ly6G<SUP>+</SUP> Gra in TB pathology.   

      (3) IL-17 Secretion by Mtb-Infected Neutrophils: Do Mtb-infected neutrophils secrete IL-17 into the supernatants? This would serve as confirmation of neutrophil-derived IL-17. Additionally, are Ly6G<SUP>+</SUP> cells producing IL-17 and serving as pathogenic agents exclusively in vivo? The authors should provide comments on this.

      Secretion of IL-17 by Mtb-infected neutrophils in vitro has been reported earlier (Hu et al. 2017). Our efforts to do a neutrophil IL-17 ELISPOT assay were not successful, and we are still standardising it. Whether there are a few neutrophil roles exclusively seen under in vivo conditions is an interesting proposition.

      (4) Characterization of IL-17-Producing Ly6G+ Granulocytes: Are the IL-17-producing Ly6G+ granulocytes a mixed population of neutrophils and eosinophils, or are they exclusively neutrophils? Sorting these cells followed by Giemsa or eosin staining could clarify this.

      This is a very important point. While usually eosinophils do not express Ly6G markers in laboratory mice, under specific contexts, including infections, eosinophils can express Ly6G. Since we have not characterized these potential Ly6G<SUP>+</SUP> sub-populations, that is one of the reasons we refer to the cell types as Ly6G<SUP>+</SUP> granulocytes, which do not exclude Ly6G<SUP>+</SUP> eosinophils. A detailed characterization of these subsets could be taken up as a separate study.

      Reviewer #3 (Public review):

      Summary:

      The authors examine how distinct cellular environments differentially control Mtb following BCG vaccination. The key findings are that IL17-producing PMNs harbor a significant Mtb load in both wild-type and IFNg<sup>-/-</sup> mice. Targeting IL17 and Cox2 improved disease and enhanced BCG efficacy over 12 weeks and neutrophils/IL17 are associated with treatment failure in humans. The authors suggest that targeting these pathways, especially in MSMD patients may improve disease outcomes.

      Thank you.

      Strengths:

      The experimental approach is generally sound and consists of low-dose aerosol infections with distinct readouts including cell sorting followed by CFU, histopathology, and RNA sequencing analysis. By combining genetic approaches and chemical/antibody treatments, the authors can probe these pathways effectively.

      Understanding how distinct inflammatory pathways contribute to control or worsen Mtb disease is important and thus, the results will be of great interest to the Mtb field

      Thank you.

      Weaknesses:

      A major limitation of the current study is overlooking the role of non-hematopoietic cells in the IFNg/IL17/neutrophil response. Chimera studies from Ernst and colleagues (Desvignes and Ernst 2009) previously described this IDO-dependent pathway following the loss of IFNg through an increased IL17 response. This study is not cited nor discussed even though it may alter the interpretation of several experiments.

      Thank you for pointing out this earlier study, which we concede, we missed discussing. We disagree on the point that results from that study may alter the interpretation of several experiments in our study. On the contrary, the main observation that loss of IFNγ causes severe IL-17 levels is aligned in both studies.

      IDO1 is known to alter T-helper cell differentiation towards Tregs and away from Th17 (Baban et al. 2009). It is absolutely feasible for the non-hematopoietic cells to regulate these events. However, that does not rule out the neutrophil production of IL-17 and the downstream pathological effect shown in this study. We have discussed and cited this study in the revised manuscript.

      Several of the key findings in mice have previously been shown (albeit with less sophisticated experimentation) and human disease and neutrophils are well described - thus the real new finding is how intracellular Mtb in neutrophils are more refractory to BCG-mediated control. However, given there are already high levels of Mtb in PMNs compared to other cell types, and there is a decrease in intracellular Mtb in PMNs following BCG immunization the strength of this finding is a bit limited.

      The reviewer’s interpretation of the BCG-refractory Mtb population in the neutrophil is interesting. The reviewer is right that neutrophils had a higher intracellular Mtb burden, which decreased in the BCG-vaccinated animals. Thus, on that account, the reviewer rightly mentions that BCG is able to control Mtb even in neutrophils. However, BCG almost clears intracellular burden from other cell types analysed, and therefore, the remnant pool of intracellular Mtb in the lungs of BCG-vaccinated animals could be mostly those present in the neutrophils. This is a substantial novel development in the field and attracts focus towards innate immune cells for vaccine efficacy. 

      References:

      Appelberg, R., A. G. Castro, S. Gomes, J. Pedrosa, and M. T. Silva. 1995. 'SuscepBbility of beige mice to Mycobacterium avium: role of neutrophils', Infect Immun, 63: 3381-7.

      Baban, B., P. R. Chandler, M. D. Sharma, J. Pihkala, P. A. Koni, D. H. Munn, and A. L. Mellor. 2009. 'IDO acBvates regulatory T cells and blocks their conversion into Th17-like T cells', J Immunol, 183: 2475-83.

      Choi, H. G., K. W. Kwon, S. Choi, Y. W. Back, H. S. Park, S. M. Kang, E. Choi, S. J. Shin, and H. J. Kim. 2020. 'AnBgen-Specific IFN-gamma/IL-17-Co-Producing CD4(+) T-Cells Are the Determinants for ProtecBve Efficacy of Tuberculosis Subunit Vaccine', Vaccines (Basel), 8.

      Cruz, A., A. G. Fraga, J. J. Fountain, J. Rangel-Moreno, E. Torrado, M. Saraiva, D. R. Pereira, T. D. Randall, J. Pedrosa, A. M. Cooper, and A. G. Castro. 2010. 'Pathological role of interleukin 17 in mice subjected to repeated BCG vaccinaBon afer infecBon with Mycobacterium tuberculosis', J Exp Med, 207: 1609-16.

      Desel, C., A. Dorhoi, S. Bandermann, L. Grode, B. Eisele, and S. H. Kaufmann. 2011. 'Recombinant BCG DeltaureC hly+ induces superior protecBon over parental BCG by sBmulaBng a balanced combinaBon of type 1 and type 17 cytokine responses', J Infect Dis, 204: 1573-84.

      Desvignes, L., and J. D. Ernst. 2009. 'Interferon-gamma-responsive nonhematopoieBc cells regulate the immune response to Mycobacterium tuberculosis', Immunity, 31: 974-85.

      Ferreg, S., O. Bonneau, G. R. Dubois, C. E. Jones, and A. Trifilieff. 2003. 'IL-17, produced by lymphocytes and neutrophils, is necessary for lipopolysaccharide-induced airway neutrophilia: IL-15 as a possible trigger', J Immunol, 170: 2106-12.

      Hellebrekers, P., N. Vrisekoop, and L. Koenderman. 2018. 'Neutrophil phenotypes in health and disease', Eur J Clin Invest, 48 Suppl 2: e12943.

      Hoshino, A., T. Nagao, N. Nagi-Miura, N. Ohno, M. Yasuhara, K. Yamamoto, T. Nakayama, and K. Suzuki. 2008. 'MPO-ANCA induces IL-17 producBon by acBvated neutrophils in vitro via classical complement pathway-dependent manner', J Autoimmun, 31: 79-89.

      Hu, S., W. He, X. Du, J. Yang, Q. Wen, X. P. Zhong, and L. Ma. 2017. 'IL-17 ProducBon of Neutrophils Enhances AnBbacteria Ability but Promotes ArthriBs Development During Mycobacterium tuberculosis InfecBon', EBioMedicine, 23: 88-99.

      Hult, C., J. T. Magla, H. P. Gideon, J. J. Linderman, and D. E. Kirschner. 2021. 'Neutrophil Dynamics Affect Mycobacterium tuberculosis Granuloma Outcomes and DisseminaBon', Front Immunol, 12: 712457.

      Katayama, M., K. Ohmura, N. Yukawa, C. Terao, M. Hashimoto, H. Yoshifuji, D. Kawabata, T. Fujii, Y. Iwakura, and T. Mimori. 2013. 'Neutrophils are essenBal as a source of IL-17 in the effector phase of arthriBs', PLoS One, 8: e62231.

      Khader, S. A., G. K. Bell, J. E. Pearl, J. J. Fountain, J. Rangel-Moreno, G. E. Cilley, F. Shen, S. M. Eaton, S. L. Gaffen, S. L. Swain, R. M. Locksley, L. Haynes, T. D. Randall, and A. M. Cooper. 2007. 'IL-23 and IL-17 in the establishment of protecBve pulmonary CD4+ T cell responses afer vaccinaBon and during Mycobacterium tuberculosis challenge', Nat Immunol, 8: 369-77.

      Leliefeld, P. H. C., J. Pillay, N. Vrisekoop, M. Heeres, T. Tak, M. Kox, S. H. M. Rooijakkers, T. W. Kuijpers, P. Pickkers, L. P. H. Leenen, and L. Koenderman. 2018. 'DifferenBal anBbacterial control by neutrophil subsets', Blood Adv, 2: 1344-55.

      Lemos, H. P., R. Grespan, S. M. Vieira, T. M. Cunha, W. A. Verri, Jr., K. S. Fernandes, F. O. Souto, I. B. McInnes, S. H. Ferreira, F. Y. Liew, and F. Q. Cunha. 2009. 'Prostaglandin mediates IL-23/IL-17induced neutrophil migraBon in inflammaBon by inhibiBng IL-12 and IFNgamma producBon', Proc Natl Acad Sci U S A, 106: 5954-9.

      Li, L., L. Huang, A. L. Vergis, H. Ye, A. Bajwa, V. Narayan, R. M. Strieter, D. L. Rosin, and M. D. Okusa. 2010. 'IL-17 produced by neutrophils regulates IFN-gamma-mediated neutrophil migraBon in mouse kidney ischemia-reperfusion injury', J Clin Invest, 120: 331-42.

      Lin, A. M., C. J. Rubin, R. Khandpur, J. Y. Wang, M. Riblen, S. Yalavarthi, E. C. Villanueva, P. Shah, M. J. Kaplan, and A. T. Bruce. 2011. 'Mast cells and neutrophils release IL-17 through extracellular trap formaBon in psoriasis', J Immunol, 187: 490-500.

      Lovewell, R. R., C. E. Baer, B. B. Mishra, C. M. Smith, and C. M. Sasseg. 2021. 'Granulocytes act as a niche for Mycobacterium tuberculosis growth', Mucosal Immunol, 14: 229-41.

      Lyadova, I. V. 2017. 'Neutrophils in Tuberculosis: Heterogeneity Shapes the Way?', Mediators Inflamm, 2017: 8619307.

      Mishra, B. B., R. R. Lovewell, A. J. Olive, G. Zhang, W. Wang, E. Eugenin, C. M. Smith, J. Y. Phuah, J. E. Long, M. L. Dubuke, S. G. Palace, J. D. Goguen, R. E. Baker, S. Nambi, R. Mishra, M. G. Booty, C. E. Baer, S. A. Shaffer, V. Dartois, B. A. McCormick, X. Chen, and C. M. Sasseg. 2017. 'Nitric oxide prevents a pathogen-permissive granulocyBc inflammaBon during tuberculosis', Nat Microbiol, 2: 17072.

      Napolitani, G., E. V. Acosta-Rodriguez, A. Lanzavecchia, and F. Sallusto. 2009. 'Prostaglandin E2 enhances Th17 responses via modulaBon of IL-17 and IFN-gamma producBon by memory CD4+ T cells', Eur J Immunol, 39: 1301-12.

      Nwongbouwoh Muefong, C., O. Owolabi, S. Donkor, S. Charalambous, A. Bakuli, A. Rachow, C. Geldmacher, and J. S. Sutherland. 2022. 'Neutrophils Contribute to Severity of Tuberculosis

      Pathology and Recovery From Lung Damage Pre- and Posnreatment', Clin Infect Dis, 74: 175766.

      Paulissen, S. M., J. P. van Hamburg, N. Davelaar, P. S. Asmawidjaja, J. M. Hazes, and E. Lubberts. 2013. 'Synovial fibroblasts directly induce Th17 pathogenicity via the cyclooxygenase/prostaglandin E2 pathway, independent of IL-23', J Immunol, 191: 1364-72.

      Pedrosa, J., B. M. Saunders, R. Appelberg, I. M. Orme, M. T. Silva, and A. M. Cooper. 2000. 'Neutrophils play a protecBve nonphagocyBc role in systemic Mycobacterium tuberculosis infecBon of mice', Infect Immun, 68: 577-83.

      Polese, B., B. Thurairajah, H. Zhang, C. L. Soo, C. A. McMahon, G. Fontes, S. N. A. Hussain, V. Abadie, and I. L. King. 2021. 'Prostaglandin E(2) amplifies IL-17 producBon by gammadelta T cells during barrier inflammaBon', Cell Rep, 36: 109456.

      Rankin, A. N., S. V. Hendrix, S. K. Naik, and C. L. Stallings. 2022. 'Exploring the Role of Low-Density Neutrophils During Mycobacterium tuberculosis InfecBon', Front Cell Infect Microbiol, 12: 901590.

      Xu, X. J., Q. Q. Ge, M. S. Yang, Y. Zhuang, B. Zhang, J. Q. Dong, F. Niu, H. Li, and B. Y. Liu. 2023. 'Neutrophil-derived interleukin-17A parBcipates in neuroinflammaBon induced by traumaBc brain injury', Neural Regen Res, 18: 1046-51.

      Reviewer #1 (Recommendations for the authors):

      All figures: Clear information about the number of repeat experiments for each figure must be included.

      We have provided the details of the number of repeat experiments in the revised version.

      Figure 1: The claim that neutrophils are a dominant cell type infected during Mtb infection of the lungs is undermined by the limited number of markers used to identify cell types. The gating strategy used to initially identify what cells are infected with Mtb divided cells into three categories; granulocytes (Ly6G<SUP>+</SUP> Cd11b<SUP>+</SUP>), CD64+MerTK+ macrophages, or Sca1+CD90.1+CD73+ (mesenchymal stem cells). This strategy leaves out monocyte populations that have been shown to be the dominant infected cells in other strategies (most recently, PMID: 36711606).

      Thank you for this important point. We agree that we did not assess the infected monocyte population, specifically the Cd11c<SUP>+</SUP> population. Both CD11c<SUP>Hi</SUP> and CD11c<SUP>Lo</SUP> monocyte cells appear to be important for Mtb infection, in different studies (Lee et al., 2020), (Zheng et al., 2024). Therefore, leaving out the CD11c<SUP>+</SUP> population in our assays was a conscious decision to ensure the clarity of the cell types being studied. 

      In addition, substantial evidence from multiple studies indicates that Ly6G⁺ granulocytes constitute the predominant infected population in the Mtb-infected lungs of both mice and humans (Lovewell et al., 2021) (Eum et al., 2010). While monocytes may contribute to Mtb infection dynamics, our findings align with a growing body of research emphasizing the significant role of neutrophils as a dominant infected cell type in the lungs during TB pathology.  

      Figure 1: Putting the data from separate panels together, it appears that very few bacteria are isolated from the three cell types in the lung, suggesting there may be some loss in the preparation steps. Why is the total sorted CFU from neutrophils, macrophages, and MSCs so low, <400 bacteria total, when the absolute CFU is so high? Is it because only a fraction of the lung is being sorted/plated?

      Yes, only a fraction of the lung was used for cell sorting and subsequent plating. The CFU plating from sorted cells also does not account for any bacteria growing extracellularly.

      Figure 3C: It is difficult to ascertain whether the gating on IL-17<SUP>+</SUP> cells is accurately identifying IL-17 producing cells. It is surprising, based on other published work, that the authors claim that almost half of CD45+CD11b-Ly6G- cells produce IL-17 in WT mice. It would be informative to show cell type-specific production of IL-17 in both WT and IFN-γ KO mice for comparison with the literature. Unstained/isotype controls for IL-17 staining should be shown. With this in mind, it is difficult to interpret the authors' claim that 80% of neutrophils produce IL-17.

      Thank you for the points above. We do agree that we were surprised to see ~50% of CD45<SUP>+</SUP> CD11b<SUP>-</SUP>Ly6G<SUP>-</SUP> cells producing IL-17. We have now done multiple experiments to confirm that this number is actually less than 1% (~90 cells) in the uninfected mice and less than 4% (~4000) in the Mtb-infected mice.

      Neutrophil-derived IL-17 production in Mtb-infected lungs is supported by two independent techniques in our current study: Flow Cytometry and Immunofluorescence assay. While  Neutrophil production of IL-17 is rarely studied in the context of TB, in several other settings it has been widely reported (Gonzalez-Orozco et al., 2019; Li et al., 2010; Ramirez-Velazquez et al., 2013). We consistently get >60% IL-17 positive cells in the CD11b<SUP>+</SUP> Ly6G<SUP>+</SUP> population, specifically in the infected samples. 

      To specifically address the reviewer’s concerns, we have now used an isotype control for IL17 staining and show the specificity of IL-17A antibody binding. The Author response image 1 is from the uninfected mice, 8 weeks age.

      Unfortunately, our efforts to establish an IL-17  ELISPOT assay from neutrophils were not very successful and need further standardisation. The new results are included in Fig. 3C-D and Fig. S2F-G in the revised manuscript.

      Author response image 1.

      Figure 3 D-H. Quantification of immunofluorescence microscopy should be provided.

      In the revised manuscript, we provide the quantification of IFA results.

      Figure 4: Effects on neutrophil numbers in IFN-γ Kos do not correlate with CFU reductions, suggesting there may be a neutrophilindependent mechanism.

      In the IFN-γ KO, we agree that the effect was less than dramatic. The immune dysfunction in the IFN-γ KO mice is too severe to see a strong reversal in the phenotype through interventions. 

      While we do not rule out any neutrophil-independent mechanism, in the context of following observations, neutrophil-dependent mechanisms certainly appear to play an important role-

      (a) Improved pathology and survival upon IL-17 neutralization, which further improves with the inclusion of celecoxib.

      (b) Loss of IL17<sup>+</sup>-Ly6G<sup>+</sup> cells upon IL-17 neutralization, which is further exacerbated when combined with celecoxib.

      (c) Significant reduction in PMN number (shown by FACS) without any major impact on Th17 cell population upon IL-17 neutralization.

      Finally, we believe some of the observations may become stronger once we characterize the specific sub-population among the Ly6G+ cells that correlates with pathology. For example, as shown in Figure 4I, FACS analysis of the Ly6G<sup>⁺</sup> cell population in Mtb-infected IFNγ<sup>⁻/⁻</sup> mice revealed a substantial subset of CD11b<sup>mid</sup> Ly6G<sup>ʰⁱ</sup> cells, indicative of an immature neutrophil population (Scapini et al., 2016). Efforts are currently underway to identify these important subpopulations.  

      Figure 4: Differences observed in the spleen cannot be connected to dissemination per se but instead could be a result of enhanced immune control in the spleen.

      Thank you for this important point. We have revised this section. The role of neutrophils in Mtb dissemination is an emerging area of research, with growing evidence suggesting that these cells contribute to the spread of Mtb beyond the lungs (Hult et al., 2021). We highlight that the observed correlation could be speculative at this juncture.

      Figure 4, 5: IL-17 neutralization alone has no effect on CFU in the lungs of Mtb-infected mice. While the combination of IL-17 neutralization and celecoxib has a very modest effect on CFU, the mechanism behind this observation is unclear. Further, the experiment shown has only 3 mice per group and it is unclear whether this (or any other) mouse experiment was repeated.

      For Fig. 4, the experiment was done with 3 mice/group. The IFN KO mice were used to help identify the mechanism. IL-17 neutralisation or Celecoxib treatment alone did not have any significant effect on the bacterial burden (in lungs or isolated PMNs). However, it did show a significant effect on the number of PMNs recruited. Combination of IL-17 neutralisation and celecoxib led to about a one-log decrease in CFU, which is significant.

      For Fig. 5, we used SR2211 instead of anti-IL-17 Ab for the experiment. This experiment had WT mice and 5 animals/group. Here, celecoxib and SR2211 alone showed a significant decline in PMN-resident Mtb pool as well as spleen burden. Only in the lungs, the impact of SR2211 alone was not significant.

      Figure 6: The decreases in CFU correlate with a decrease in neutrophils; nothing connects this to neutrophil production of IL-17.

      We now show quantification of observation in Fig. 5I, where in the WT mice, treatment with Celecoxib reduces the frequency of IL-17-producing Ly6G+ cells. In the revised manuscript, we also show direct evidence of SR2211 activity on BCG vaccine efficacy, which causes a significant decline in the Mtb burden in whole lung or in the isolated PMNs.

      Figure 7. The Human data shows that elevated neutrophil levels and elevated IL-17 levels are associated with treatment failure in TB patients. This is expected, and does not

      The literature lacks consensus in terms of a protective or pathological role of IL-17 in TB. Therefore, it was not expected to see higher IL-17 in patients who experienced relapse, death, or failed treatment outcomes. We do not have evidence from human subjects whether neutrophil derived IL-17 has a similar pathological role as observed in mice. However, higher IL-17 in failed outcome cases confirm the central theme that IL-17 is pathological in both human and mouse models.

      Reviewer #2 (Recommendations for the authors):

      (1) Survival of IFN-γ-/- Mice: The survival of IFN-γ-/- mice up to 100 days following a challenge with ~100 CFU of H37Rv is quite unusual. Have the authors checked PDIM expression in their Mtb strain, given that several studies report earlier mortality in these mice?

      As shown in Fig. 4F, H37Rv-infected IFN-γ⁻/⁻ mice survived up to a little over 80 days. These figures are not unusual in the light of the following:

      (1) In one study, IFNγ⁻/⁻ survived for about 40 days when the hypervirulent Mtb strain was used to infect these mice at 100-200 CFU using nose-only aerosol exposure (Nandi and Behar, 2011)

      (2) In yet another study, IFNγ⁻/⁻ mice survived for ~50 days, however, they used H37Rv at 1-3x10<sup>5</sup> CFU to infect through intravenous injection (Kawakami et al., 2004)

      Thus, compared with the above observations, where IFN-γ<sup>-/-</sup> mice survived for maximum 50 days due to hypervirulent infection or a very high dose infection, infection with H37Rv at ~100 CFU through the aerosol route and surviving for ~80 days is not unusual. The H37Rv cultures used in our study are always animal-passaged to ensure PDIM integrity.

      (2) Granuloma Scoring: The granuloma scores appear to represent the percentage of lesion area. Please clarify and, if necessary, amend this in the manuscript.

      The granuloma score is based on the calculation of the number of granulomatous infiltration and their severity. These are not % lesion area. We have added this detail in the revised manuscript.

      (3) Pathology Comparison in Figures 4F and 4G: Does the pathology shown in Figure 4G correspond to the same groups as in Figure 4F? The celecoxib group in Figure 4F and the WT group in Figure 4G seem to be missing. Please clarify.

      Figures 4F and 4G depict two independent experiments. For the time-to-death experiment, we had to leave the animals. The rest of the panels in Fig. 4 represent animals from the same experiment.

      (4) Effect of Celecoxib on Ly6G+ Cells: The authors demonstrated that celecoxib treatment reduces Ly6G+ cells and IL-17-producing Ly6G+ cells. Do Ly6G+ cells express EP2/EP4 receptors? Alternatively, could the reduction in IL-17-producing Ly6G+ cells be due to an improved bactericidal response in other innate cells? The authors should discuss this possibility.

      Yes, Ly6G<sup>⁺</sup> granulocytes express EP2/EP4 receptors (Lavoie et al., 2024), which mediate PGE₂ signaling. Prostaglandin E<sub>₂</sub> (PGE<sub>₂</sub>) is known to regulate neutrophil function and can enhance IL-17 production in various immune cells (Napolitani et al., 2009). However, the expression and functional role of EP2/EP4 receptors specifically on Ly6G<sup>⁺</sup> granulocytes in the context of Mtb infection require further investigation.

      The alternate suggestion by the reviewer that the reduction in IL-17-producing Ly6G<sup>⁺</sup> cells following celecoxib treatment could be attributed to an improved bactericidal response in other innate immune cells is attractive. While we did not experimentally rule out this possibility, since reduced IL-17 invariably associated with reduced neutrophil-resident Mtb population, a cell-autonomous mechanism operational in Ly6G+ granulocytes is a highly likely mechanism.  

      (5) Culture Conditions: The methods section indicates that bacteria were cultured in 7H9+ADC. Is there a specific reason why the Oleic acid supplement was not added, given that standard Mtb culture conditions typically use 7H9+OADC supplements? Please comment on this choice.

      It is a standard microbiological experimental procedure to use 7H9+ADC for broth culture, while 7H11+OADC for solid culture. Compared to broth culture, solid media are usually more stressful for bacteria because of hypoxia inside the growing colonies. Therefore, the media used are enriched in casein hydrolysate (like 7H11) and oleic acid (OADC).

      Reviewer #3 (Recommendations for the authors):

      Major suggestion: To really determine the role of neutrophil IL17 will require depletion studies and chimera experiments. These are clearly a major undertaking. I believe making significant re-writes to alter the conclusions or reanalyze any data to determine the role of nonhematopoietic and hematopoietic cells in IL17 is needed. If the conclusions are left as is, further experimentation is needed to fully support those conclusions.

      Thank you for the suggestion. We have embarked on the specific deletion studies; however, as mentioned, this is a major undertaking and will take time. As suggested, we have discussed the results in accordance with the strength of evidence currently provided.

      Eum, S.Y., J.H. Kong, M.S. Hong, Y.J. Lee, J.H. Kim, S.H. Hwang, S.N. Cho, L.E. Via, and C.E. Barry, 3rd. 2010. Neutrophils are the predominant infected phagocyGc cells in the airways of paGents with acGve pulmonary TB. Chest 137:122-128.

      Gonzalez-Orozco, M., R.E. Barbosa-Cobos, P. Santana-Sanchez, L. Becerril-Mendoza, L. Limon-

      Camacho, A.I. Juarez-Estrada, G.E. Lugo-Zamudio, J. Moreno-Rodriguez, and V. OrGzNavarrete. 2019. Endogenous sGmulaGon is responsible for the high frequency of IL-17Aproducing neutrophils in paGents with rheumatoid arthriGs. Allergy Asthma Clin Immunol 15:44.

      References

      Hult, C., J.T. Ma[la, H.P. Gideon, J.J. Linderman, and D.E. Kirschner. 2021. Neutrophil Dynamics Affect Mycobacterium tuberculosis Granuloma Outcomes and DisseminaGon. Front Immunol 12:712457.

      Kawakami, K., Y. Kinjo, K. Uezu, K. Miyagi, T. Kinjo, S. Yara, Y. Koguchi, A. Miyazato, K. Shibuya, Y. Iwakura, K. Takeda, S. Akira, and A. Saito. 2004. Interferon-gamma producGon and host protecGve response against Mycobacterium tuberculosis in mice lacking both IL-12p40 and IL-18. Microbes Infect 6:339-349.

      Lavoie, J.C., M. Simard, H. Kalkan, V. Rakotoarivelo, S. Huot, V. Di Marzo, A. Cote, M. Pouliot, and N. Flamand. 2024. Pharmacological evidence that the inhibitory effects of prostaglandin E2 are mediated by the EP2 and EP4 receptors in human neutrophils. J Leukoc Biol 115:1183-1189.

      Lee, J., S. Boyce, J. Powers, C. Baer, C.M. Sasse[, and S.M. Behar. 2020. CD11cHi monocyte-derived macrophages are a major cellular compartment infected by Mycobacterium tuberculosis. PLoS Pathog 16:e1008621.

      Li, L., L. Huang, A.L. Vergis, H. Ye, A. Bajwa, V. Narayan, R.M. Strieter, D.L. Rosin, and M.D. Okusa. 2010. IL-17 produced by neutrophils regulates IFN-gamma-mediated neutrophil migraGon in mouse kidney ischemia-reperfusion injury. J Clin Invest 120:331-342.

      Lovewell, R.R., C.E. Baer, B.B. Mishra, C.M. Smith, and C.M. Sasse[. 2021. Granulocytes act as a niche for Mycobacterium tuberculosis growth. Mucosal Immunol 14:229-241.

      Nandi, B., and S.M. Behar. 2011. RegulaGon of neutrophils by interferon-gamma limits lung inflammaGon during tuberculosis infecGon. The Journal of experimental medicine 208:22512262.

      Napolitani, G., E.V. Acosta-Rodriguez, A. Lanzavecchia, and F. Sallusto. 2009. Prostaglandin E2 enhances Th17 responses via modulaGon of IL-17 and IFN-gamma producGon by memory CD4+ T cells. Eur J Immunol 39:1301-1312.

      Ramirez-Velazquez, C., E.C. CasGllo, L. Guido-Bayardo, and V. OrGz-Navarrete. 2013. IL-17-producing peripheral blood CD177+ neutrophils increase in allergic asthmaGc subjects. Allergy Asthma Clin Immunol 9:23.

      Sadikot, R.T., H. Zeng, A.C. Azim, M. Joo, S.K. Dey, R.M. Breyer, R.S. Peebles, T.S. Blackwell, and J.W. Christman. 2007. Bacterial clearance of Pseudomonas aeruginosa is enhanced by the inhibiGon of COX-2. Eur J Immunol 37:1001-1009.

      Zheng, W., I.C. Chang, J. Limberis, J.M. Budzik, B.S. Zha, Z. Howard, L. Chen, and J.D. Ernst. 2023. Mycobacterium tuberculosis resides in lysosome-poor monocyte-derived lung cells during chronic infecGon. bioRxiv 

      Zheng, W., I.C. Chang, J. Limberis, J.M. Budzik, B.S. Zha, Z. Howard, L. Chen, and J.D. Ernst. 2024. Mycobacterium tuberculosis resides in lysosome-poor monocyte-derived lung cells during chronic infecGon. PLoS Pathog 20:e1012205.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The manuscript by Yang et al. investigates the relationship between multi-unit activity in the locus coeruleus, putatively noradrenergic locus coeruleus, hippocampus (HP), sharp-wave ripples (SWR), and spindles using multi-site electrophysiology in freely behaving male rats. The study focuses on SWR during quiet wake and non-REM sleep, and their relation to cortical states (identified using EEG recordings in frontal areas) and LC units.

      The manuscript highlights differential modulation of LC units as a function of HP-cortical communication during wake and sleep. They establish that ripples and LC units are inversely correlated to levels of arousal: wake, i.e., higher arousal correlates with higher LC unit activity and lower ripple rates. The authors show that LC neuron activity is strongly inhibited just before SWR is detected during wake. During non-REM sleep, they distinguish "isolated" ripples from SWR coupled to spindles and show that inhibition of LC neuron activity is absent before spindle-coupled ripples but not before isolated ripples, suggesting a mechanism where noradrenaline (NA) tone is modulated by HP-cortical coupling. This result has interesting implications for the roles of noradrenaline in the modulation of sleep-dependent memory consolidation, as ripple-spindle coupling is a mechanism favoring consolidation. The authors further show that NA neuronal activity is downregulated before spindles.

      Strengths:

      In continuity with previous work from the laboratory, this work expands our understanding of the activity of neuromodulatory systems in relation to vigilance states and brain oscillations, an area of research that is timely and impactful. The manuscript presents strong results suggesting that NA tone varies differentially depending on the coupling of HP SWR with cortical spindles. The authors place their findings back in the context of identified roles of HP ripples and coupling to cortical oscillations for memory formation in a very interesting discussion. The distinction of LC neuron activity between awake, ripple-spindle coupled events and isolated ripples is an exciting result, and its relation to arousal and memory opens fascinating lines of research.

      Weaknesses:

      I regretted that the paper fell short of trying to push this line of idea a bit further, for example, by contrasting in the same rats the LC unit-HP ripple coupling during exploration of a highly familiar context (as seemingly was the case in their study) versus a novel context, which would increase arousal and trigger memory-related mechanisms. Any kind of manipulation of arousal levels and investigation of the impact on awake vs non-REM sleep LC-HP ripple coordination would considerably strengthen the scope of the study.

      We agree that conducting specific behavioral tests before electrophysiological recordings, as well as manipulating arousal during the recording session, would strengthen the study. These experiments are planned for future work, and we will acknowledge this point in the discussion.

      The main result shows that LC units are not modulated during non-REM sleep around spindle-coupled ripples (named spRipples, 17.2% of detected ripples); they also show that LC units are modulated around ripple-coupled spindles (ripSpindles, proportion of detected spindles not specified, please add). These results seem in contradiction; this point should be addressed by the authors.

      We found that LC suppression was generally weak around both types of coupled events (spRipples and ripSpindles). Specifically, session-averaged spRipple-associated LC suppression reached a significance level (exceeding 95% CI) in 4 (n = 3 rats) out of 20 sessions (Line 177). The significant ripSpindle-associated LC suppression was observed in 3 (n = 2 animals) out of 20 sessions (Line 213). When comparing the modulation index (MI) around spRipples and ripSpindles, we found a significant correlation (Pearson r = 0.72, p = 0.0003). As shown in Author response image 1 below, the three sessions (blue square, MI < 95%CI) with significant ripSpindle-associated LC suppression coincide with those sessions showing LC modulation around spRipples. Although, the detection of coupled events was performed independently, some overlap can not be excluded. We will be happy to provide this additional information in the results section.

      Author response image 1.

      Results are displayed per recording session, with 20 sessions total recorded from 7 rats (2 to 8 sessions per rat), which implies that one of the rats accounts for 40% of the dataset. Authors should provide controls and/or data displayed as average per rat to ensure that results are now skewed by the weight of that single rat in the results.

      Since high-quality recordings from the LC in behaving rats are challenging and rare, we used all valid sessions for this study. In Author response image 2 below, we plotted the average MIs for each animal (A) and each session (B). The dashed lines indicate the mean ± 2 standard deviations across all sessions. The rat ID and number of sessions is indicated in parentheses in A. All animal-averaged MIs fall within this range, indicating that the MI distribution is not driven by a single animal (rat 1101, 8 sessions). The MIs of eight sessions from rat1101 are shown in grey-filled triangles (B). Comparison of the MI distribution for these eight sessions versus the remaining 12 sessions from six other animals revealed no significant difference (Kolmogorov-Smirnov test, p = 0.969). We will be happy to provide this additional information in the Results section.

      Author response image 2.

      In its current form, the manuscript presents a lack of methodological detail that needs to be addressed, as it clouds the understanding of the analysis and conclusions. For example, the method to account for the influence of cortical state on LC MUA is unclear, both for the exact methods (shuffling of the ripple or spindle onset times) and how this minimizes the influence of cortical states; this should be better described. If the authors wish to analyze unit modulation as a function of cortical state, could they also identify/sort based on cortical states and then look at unit modulation around ripple onset? For the first part of the paper, was an analysis performed on quiet wake, non-REM sleep, or both?

      As shown in Figure 3A and described in the main text (Lines 113–116), LC firing rate was negatively correlated with cortical arousal as quantified by Synchronisation Index (SI), whereas ripple rate was positively correlated with arousal. When computing LC activity (0.05 sec bins) aligned to the ripple onset over a longer time window ([–12, 12] sec), we observed a slow decrease in the LC firing rate beginning as early as 10 s before the ripple onset. In Author response image 3 below, a blue trace shows this slower temporal dynamic in a representative session. In addition to LC activity modulation at this relatively slow temporal scale, we also observed a much sharper drop in the LC firing rate ~ 2 s before the ripple onset. Considering two temporal scales, we hypothesized that slow modulation of LC activity might be related to fluctuations of the global brain state. Specifically, a higher SI (more synchronized cortical population activity) corresponded to a lower arousal state and reduced LC tonic firing; this brain state was associated with a higher ripple activity. Thus, slow LC modulation was likely driven by cortical state transitions. To correct for the influence of the global brain state on the LC/ripple temporal dynamics, we generated surrogate events by jittering the times of detected ripples (Lines 415–421). First, we confirmed that the cortical state did not differ around ripples and surrogate events (Figure 3C), while triggering the hippocampal LFP on the surrogate events lacked the ripple-specific frequency component (Figure 3B,). Thus, LC activity around surrogate evens captured its cortical state dependent dynamics (see orange trace in Author response image 3 below). Finally, to characterize state-independent ripple-related LC activity, we subtracted the state-related LC activity (orange trace in Author response image 3 below) from the ripple-triggered LC activity (blue trace). This yielded a corrected estimate of ripple-associated LC activity that was largely free from the confounding influence of cortical state transitions.

      Author response image 3.

      In the results subsection “LC-NE neuron spiking is suppressed around hippocampal ripples”, we reported LC modulation without accounting for the cortical state. The state-dependent effects were instead examined in the subsequent subsection, “Peri-ripple LC modulation depends on the cortical–hippocampal interaction,” where we characterized LC activity around ripples across different cortical states (quite awake and NREM sleep). We will provide more methodological details and a rationale for each analysis, as requested.

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors studied the synchrony between ripple events in the Hippocampus, cortical spindles, and Locus Coeruleus spiking. The results in this study, together with the established literature on the relationship of hippocampal ripples with widespread thalamic and cortical waves, guided the authors to propose a role for Locus Coeruleus spiking patterns in memory consolidation. The findings provided here, i.e., correlations between LC spiking activity and Hippocampal ripples, could provide a basis for future studies probing the directional flow or the necessity of these correlations in the memory consolidation process. Hence, the paper provides enough scientific advances to highlight the elusive yet important role of Norepinephrine circuitry in the memory processes.

      Strengths:

      The authors were able to demonstrate correlations of Locus Coeruleus spikes with hippocampal ripples as well as with cortical spindles. A specific strength of the paper is in the demonstration that the spindles that activate with the ripples are comparatively different in their correlations with Locus Coeruleus than those that do not.

      Weaknesses:

      The claims regarding the roles of these specific interactions were mostly derived from the literature that these processes individually contribute to the memory process, without any evidence of these specific interactions being necessary for memory processes. There are also issues with the description of methods, validation of shuffling procedures, and unclear presentation and the interpretation of the findings, which are described in the points that follow. I believe addressing these weaknesses might improve and add to the strength of the findings.

      We believe that our responses to the Reviewer 1 and planned revisions as described above will adequately address the issues raised by the Reviewer 2. 

      Reviewer #3 (Public review):

      Summary:

      This manuscript examines how locus coeruleus (LC) activity relates to hippocampal ripple events across behavioral states in freely moving rats. Using multi-site electrophysiological recordings, the authors report that LC activity is suppressed prior to ripple events, with the magnitude of suppression depending on the ripple subtype. Suppression is stronger during wakefulness than during NREM sleep and is least pronounced for ripples coupled to spindles.

      Strengths:

      The study is technically competent and addresses an important question regarding how LC activity interacts with hippocampal and thalamocortical network events across vigilance states.

      Weaknesses:

      The results are interesting, but entirely observational. Also, the study in its current form would benefit from optimization of figure labeling and presentation, and more detailed result descriptions to make the findings fully interpretable. Also, it would be beneficial if the authors could formulate the narrative and central hypothesis more clearly to ease the line of reasoning across sections.

      We will do our best to optimize presentation, revise the main text and figure labelling. When appropriate, we will add specific hypotheses and a rationale for specific analyses.

      Comments:

      (1) Stronger evidence that recorded units represent noradrenergic LC neurons would reinforce the conclusions. While direct validation may not be possible, showing absolute firing rates (Hz) across quiet wake, active wake, NREM, and REM, and comparing them to published LC values, would help.

      We will provide the requested data in the revised manuscript.

      (2) The analyses rely almost exclusively on z-scored LC firing and short baselines (~4-6 s), which limits biological interpretation. The authors should include absolute firing rates alongside normalized values for peri-ripple and peri-spindle analyses and extend pre-event windows to at least 20-30 s to assess tonic firing evolution. This would clarify whether differences across ripple subtypes arise from ceiling or floor effects in LC activity; if ripples require LC silence, the relative drop will appear larger during high-firing wake states. This limitation should be discussed and, if possible, results should be shown based on unnormalized firing rates.

      We can provide absolute firing rates alongside normalized values for peri-ripple and peri-spindle analyses for isolated single LC units. However, we are reluctant to average absolute firing rates for multiunit activity, as it is unknown how many neurons contributed to each MUA recording. We can add the plots with extended pre-event windows ([–12, 12] sec). Please see our response to the Reviewer 1 about the two temporal scales of LC modulation.

      (3) Because spindles often occur in clusters, the timing of ripple occurrence within these clusters could influence LC suppression. Indicate whether this structure was considered or discuss how it might affect interpretation (e.g., first vs. subsequent ripples within a spindle cluster).

      We did not consider spindle clusters and classified the event as ripple coupled spindle if the ripple occurred between the spindle on- and offset. We will clarify this point in the Method section. 

      (4) While the observational approach is appropriate here, causal tests (e.g., optogenetic or chemogenetic manipulation of LC around ripple events and in memory tasks) would considerably strengthen the mechanistic conclusions. At a minimum, a discussion of how such approaches could address current open questions would improve the manuscript.

      We agree that conducting causal tests would strengthen the study. We will acknowledge in the discussion that our results shall inspire future studies addressing many open questions.

      (5) Please show how "Synchronization Index" (SI) differs quantitatively across behavioral states (wake, NREM, REM) and discuss whether it could serve as a state classifier. This would strengthen interpretations of the correlations between SI, ripple occurrence, and LC activity.

      We will add the plot showing the average SI values across behavioral states. Although SI could potentially serve as a classifier, we have chosen not to discuss this in detail to maintain focus in the discussion.

      (6) The current use of SI to denote a delta/gamma power ratio is unconventional, as "SI" typically refers to phase-locking metrics. Consider adopting a more standard term, such as delta/gamma power ratio. Similarly, it would be easier to follow if you use common terminology (AUC) to describe the drop in LC-MUA rather than using "MI" and "sub-MI".

      The ranges of delta and gamma bands might vary across studies; therefore, we prefer using SI, as defined here and in our previous publications (Yang, 2019; Novitskaya, 2012). We calculated the modulation index (MI) as the area under the curve of the peri-event time histogram within the 1 second preceding ripple onset. To avoid potential confusion with the AUC calculated over the entire signal window, we opted to use MI. 

      (7) The logic in Figure 3 is difficult to follow. The brain state (delta/gamma ratio) appears unchanged relative to surrogate events (3C), while LC activity that is supposedly negatively correlated to delta/gamma changes markedly (3D-E). Could this discrepancy reflect the low temporal resolution (4-s windows) used to calculate delta/gamma when the changes occur on a shorter time scale?

      Figure 3D and 3E show the 'state-corrected' ripple-related LC activity. Specifically, the cortical state related LC modulation was subtracted from the non-corrected ripple-associated LC activity. Please, see our detailed response to the Reviewer 1. We will revise the results and Figure 3 legend to clarify this point.

      (8) There are apparent inconsistencies between Figures 4B and 4C-D. In B, it seems that the difference between the 10th and 90th percentile is mostly in higher frequencies, but in C and D, the only significant difference is in the delta band.

      We will re-do this analysis and clarify this inconsistency.

      (9) Because standard sleep scoring is based on EEG and EMG signals, please include an example of sleep scoring alongside the data used for state classification. It would also be relevant to include the delta/gamma power ratio in such an example plot.

      We removed ‘standard’ and will add a supplementary Figure illustrating sleep scoring.

      (10) Can variability in modulation index (subMI) across ripple subsets reflect differences in recording quality? Please report and compare mean LC firing rates across subsets to confirm this is not a confounding factor.

      We will plot this result averaged per rat.

      (11) Figure 6B: If the brown trace represents LC-MUA activity around random time points, why would there be a coinciding negative peak as relative to real sleep spindles? Or is it the subtracted trace?

      We will clarify this point in the figure legend.

      (12) On page 8, lines 207-209, the authors write "Importantly, neither the LC-MUA rate nor SIs differed during a 2-sec time window preceding either group of spindles". It is unclear which data they refer to, but the statement seems to contradict Figure 6E as well as the following sentence: "Across sessions, MI values exceeded 95% CI in 17/20 datasets for isoSpindles and only 3/20 for ripSpindles". This should be clarified.

      We will clarify the description of this result.

      (13) The results in Figures 5C and 6F do not align. It seems surprising that ripple-coupled spindles show a considerably higher LC modulation than spindle-coupled ripples, as these events should overlap. Could the discrepancy be due to Z-score normalization as mentioned above? Please include a discussion of this to help the interpretation of the results.

      We will clarify this point in the revised manuscript. Please, also see our response to the Reviewer 1.

      (14) The text implies that 8 recordings came from one rat and two each from six others. This should be confirmed, and it should be explained how the recordings were balanced and analyzed across animals.

      Since high-quality recordings from LC in behaving animals are challenging and rare, we used all valid sessions. We will also present the main results averaged per rat, as also requested by the Reviewer 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Syed et al. investigate the circuit underpinnings for leg grooming in the fruit fly. They identify two populations of local interneurons in the right front leg neuromere of ventral nerve cord, i.e. 62 13A neurons and 64 13B neurons. Hierarchical clustering analysis identifies 10 morphological classes for both populations. Connectome analysis reveals their circuit interactions: these GABAergic interneurons provide synaptic inhibition either between the two subpopulations, i.e., 13B onto 13A, or among each other, i.e., 13As onto other 13As, and/or onto leg motoneurons, i.e., 13As and 13Bs onto leg motoneurons. Interestingly, 13A interneurons fall into two categories, with one providing inhibition onto a broad group of motoneurons, being called "generalists", while others project to a few motoneurons only, being called "specialists". Optogenetic activation and silencing of both subsets strongly affect leg grooming. As well aas ctivating or silencing subpopulations, i.e., 3 to 6 elements of the 13A and 13B groups, has marked effects on leg grooming, including frequency and joint positions, and even interrupting leg grooming. The authors present a computational model with the four circuit motifs found, i.e., feed-forward inhibition, disinhibition, reciprocal inhibition, and redundant inhibition. This model can reproduce relevant aspects of the grooming behavior.

      Strengths:

      The authors succeeded in providing evidence for neural circuits interacting by means of synaptic inhibition to play an important role in the generation of a fast rhythmic insect motor behavior, i.e., grooming. Two populations of local interneurons in the fruit fly VNC comprise four inhibitory circuit motifs of neural action and interaction: feed-forward inhibition, disinhibition, reciprocal inhibition, and redundant inhibition. Connectome analysis identifies the similarities and differences between individual members of the two interneuron populations. Modulating the activity of small subsets of these interneuron populations markedly affects the generation of the motor behavior, thereby exemplifying their important role in generating grooming.

      We thank the reviewer for their thoughtful and constructive evaluation of our work. 

      Weaknesses:

      Effects of modulating activity in the interneuron populations by means of optogenetics were conducted in the so-called closed-loop condition. This does not allow for differentiation between direct and secondary effects of the experimental modification in neural activity, as feedforward and feedback effects cannot be disentangled. To do so, open loop experiments, e.g., in deafferented conditions, would be important. Given that many members of the two populations of interneurons do not show one, but two or more circuit motifs, it remains to be disentangled which role the individual circuit motif plays in the generation of the motor behavior in intact animals.

      Our optogenetic experiments show a role for 13A/B neurons in grooming leg movements – in an intact sensorimotor system - but we cannot yet differentiate between central and reafferent contributions. Activation of 13As or 13Bs disinhibits motor neurons and that is sufficient to induce walking/grooming. Therefore, we can show a role for the disinhibition motif.

      Proprioceptive feedback from leg movements could certainly affect the function of these reciprocal inhibition circuits. Given the synapses we observe between leg proprioceptors and 13A neurons, we think this is likely.

      Our previous work (Ravbar et al 2021) showed that grooming rhythms in dusted flies persist when sensory feedback is reduced, indicating that central control is possible. In those experiments, we used dust to stimulate grooming and optogenetic manipulation to broadly silence sensory feedback. We cannot do the same here because we do not yet have reagents to separately activate sparse subsets of inhibitory neurons while silencing specific proprioceptive neurons. More importantly, globally silencing proprioceptors would produce pleiotropic effects and severely impair baseline coordination, making it difficult to distinguish whether observed changes reflect disrupted rhythm generation or secondary consequences of impaired sensory input. Therefore, the reviewer is correct – we do not know whether the effects we observe are feedforward (central), feedback sensory, or both. We have included this in the revised results and discussion section to describe these possibilities and the limits of our current findings.

      Additionally, we have used a computational model to test the role of each motif separately and we show that in the results.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Syed et al. presents a detailed investigation of inhibitory interneurons, specifically from the 13A and 13B hemilineages, which contribute to the generation of rhythmic leg movements underlying grooming behavior in Drosophila. After performing a detailed connectomic analysis, which offers novel insights into the organization of premotor inhibitory circuits, the authors build on this anatomical framework by performing optogenetic perturbation experiments to functionally test predictions derived from the connectome. Finally, they integrate these findings into a computational model that links anatomical connectivity with behavior, offering a systems-level view of how inhibitory circuits may contribute to grooming pattern generation.

      Strengths:

      (1) Performing an extensive and detailed connectomic analysis, which offers novel insights into the organization of premotor inhibitory circuits.

      (2) Making sense of the largely uncharacterized 13A/13B nerve cord circuitry by combining connectomics and optogenetics is very impressive and will lay the foundation for future experiments in this field.

      (3) Testing the predictions from experiments using a simplified and elegant model.

      We thank the reviewer for their thoughtful and encouraging evaluation of our work. 

      Weaknesses:

      (1) In Figure 4, while the authors report statistically significant shifts in both proximal inter-leg distance and movement frequency across conditions, the distributions largely overlap, and only in Panel K (13B silencing) is there a noticeable deviation from the expected 7-8 Hz grooming frequency. Could the authors clarify whether these changes truly reflect disruption of the grooming rhythm? 

      We reanalyzed the dataset with Linear Mixed Models. We find significant differences in mean frequencies upon silencing these neurons but not upon activation. The experimental groups are also significantly more variable. We revised these panels with updated analysis. We think these data do support our interpretation that the grooming rhythms are disrupted. 

      More importantly, all this data would make the most sense if it were performed in undusted flies (with controls) as is done in the next figure.

      In our assay conditions, undusted flies groom infrequently. We used undusted flies for some optogenetic activation experiments, where the neuron activation triggers behavior initiation, but we chose to analyze the effect of silencing inhibitory neurons in dusted flies because dust reliably activates mechanosensory neurons and elicits robust grooming behavior enabling us to assess how manipulation of 13A/B neurons alters grooming rhythmicity and leg coordination.

      (2) In Figure 4-Figure Supplement 1, the inclusion of walking assays in dusted flies is problematic, as these flies are already strongly biased toward grooming behavior and rarely walk. To assess how 13A neuron activation influences walking, such experiments should be conducted in undusted flies under baseline locomotor conditions.

      We agree that there are better ways to assay potential contributions of 13A/13B neurons to walking. We intended to focus on how normal activity in these inhibitory neurons affects coordination during grooming, and we included walking because we observed it in our optogenetic experiments and because it also involves rhythmic leg movements. The walking data is reported in a supplementary figure because we think this merits further study with assays designed to quantify walking specifically. We will make these goals clearer in the revised manuscript and we are happy to share our reagents with other research groups more equipped to analyze walking differences.

      (3) For broader lines targeting six or more 13A neurons, the authors provide specific predictions about expected behavioral effects-e.g., that activation should bias the limb toward flexion and silencing should bias toward extension based on connectivity to motor neurons. Yet, when using the more restricted line labeling only two 13A neurons (Figure 4 - Figure Supplement 2), no such prediction is made. The authors report disrupted grooming but do not specify whether the disruption is expected to bias the movement toward flexion or extension, nor do they discuss the muscle target. This is a missed opportunity to apply the same level of mechanistic reasoning that was used for broader manipulations.

      Because we cannot unambiguously identify one of the neurons from our sparsest 13A splitGAL4 lines in FANC, we cannot say with certainty which motor neurons they target. That limits the accuracy of any functional predictions.  

      (4) Regarding Figure 5: The 70ms on/off stimulation with a slow opsin seems problematic. CsChrimson off kinetics are slow and unlikely to cause actual activity changes in the desired neurons with the temporal precision the authors are suggesting they get. Regardless, it is amazing that the authors get the behavior! It would still be important for the authors to mention the optogenetics caveat, and potentially supplement the data with stimulation at different frequencies, or using faster opsins like ChrimsonR.

      We were also intrigued by the behavioral consequences of activating these inhibitory neurons with CsChrimson. We appreciate the reviewer’s point that CsChrimson’s slow off-kinetics limit precise temporal control. To address this, we repeated our frequency analysis using a range of pulse durations (10/10, 50/50, 70/70, 110/110, and 120/120 ms on/off) and compared the mean frequency of proximal joint extension/flexion cycles across conditions. We found no significant difference in frequency (LLMS, p > 0.05), suggesting that the observed grooming rhythm is not dictated by pulse period but instead reflects an intrinsic property of the premotor circuit once activated. We now include these results in ‘Figure 5—figure supplement 1’ and clarify in the text that we interpret pulsed activation as triggering, rather than precisely pacing, the endogenous grooming rhythm. We continue to note in the manuscript that CsChrimson’s slow off-kinetics may limit temporal precision. We will try ChrimsonR in future experiments.

      Overall, I think the strengths outweigh the weaknesses, and I consider this a timely and comprehensive addition to the field.

      Reviewer #3 (Public review):

      Summary:

      The authors set out to determine how GABAergic inhibitory premotor circuits contribute to the rhythmic alternation of leg flexion and extension during Drosophila grooming. To do this, they first mapped the ~120 13A and 13B hemilineage inhibitory neurons in the prothoracic segment of the VNC and clustered them by morphology and synaptic partners. They then tested the contribution of these cells to flexion and extension using optogenetic activation and inhibition and kinematic analyses of limb joints. Finally, they produced a computational model representing an abstract version of the circuit to determine how the connectivity identified in EM might relate to functional output. The study, in its current form, makes an important but overclaimed contribution to the literature due to a mismatch between the claims in the paper and the data presented.

      Strengths:

      The authors have identified an interesting question and use a strong set of complementary tools to address it:

      (1) They analysed serial‐section TEM data to obtain reconstructions of every 13A and 13B neuron in the prothoracic segment. They manually proofread over 60 13A neurons and 64 13B neurons, then used automated synapse detection to build detailed connectivity maps and cluster neurons into functional motifs.

      (2) They used optogenetic tools with a range of genetic driver lines in freely behaving flies to test the contribution of subsets of 13A and 13B neurons.

      (3) They used a connectome-constrained computational model to determine how the mapped connectivity relates to the rhythmic output of the behavior.

      Weaknesses:

      The manuscript aims to reveal an instructive, rhythm-generating role for premotor inhibition in coordinating the multi-joint leg synergies underlying grooming. It makes a valuable contribution, but currently, the main claims in the paper are not well-supported by the presented evidence.

      Major points

      (1) Starting with the title of this manuscript, "Inhibitory circuits generate rhythms for leg movements during Drosophila grooming", the authors raise the expectation that they will show that the 13A and 13B hemilineages produce rhythmic output that underlies grooming. This manuscript does not show that. For instance, to test how they drive the rhythmic leg movements that underlie grooming requires the authors to test whether these neurons produce the rhythmic output underlying behavior in the absence of rhythmic input. Because the optogenetic pulses used for stimulation were rhythmic, the authors cannot make this point, and the modelling uses a "black box" excitatory network, the output of which might be rhythmic (this is not shown). Therefore, the evidence (behavioral entrainment; perturbation effects; computational model) is all indirect, meaning that the paper's claim that "inhibitory circuits generate rhythms" rests on inferred sufficiency. A direct recording (e.g., calcium imaging or patch-clamp) from 13A/13B during grooming - outside the scope of the study - would be needed to show intrinsic rhythmogenesis. The conclusions drawn from the data should therefore be tempered. Moreover, the "black box" needs to be opened. What output does it produce? How exactly is it connected to the 13A-13B circuit? 

      We modified the title to better reflect our strongest conclusions: “Inhibitory circuits control leg movements during Drosophila grooming”

      Our optogenetic activation was delivered in a patterned (70 ms on/off) fashion that entrains rhythmic movements, but this does not rule out the possibility that the rhythm is imposed externally. In the manuscript, we state that we used pulsed light to mimic a flexion-extension cycle and note that this approach tests whether inhibition is sufficient to drive rhythmic leg movements when temporally patterned. While this does not prove that 13A/13B neurons are intrinsic rhythm generators, it does demonstrate that activating subsets of inhibitory neurons is sufficient to elicit alternating leg movements resembling natural grooming and walking.

      Our goal with the model was to demonstrate that it is possible to produce rhythmic outputs with this 13A/B circuit, based on the connectome. The “black box” is a small recurrent neural network (RNN) consisting of 40 neurons in its hidden layer. The inputs are the “dust” levels from the environment (the green pixels in Figure 6I), the “proprioceptive” inputs (“efference copy” from motor neurons), and the amount of dust accumulated on both legs. The outputs (all positive) connect to the 13A neurons, the 13B neurons, and to the motor neurons. We refer to it as the “black box” because we make no claims about the actual excitatory inputs to these circuits. Its function is to provide input, needed to run the network, that reflects the distribution of “dust” in the environment as well as the information about the position of the legs.  

      The output of the “black box” component of the model might be rhythmic. In fact, in most instances of the model implementation this is indeed the case. However, as mentioned in the current version of the manuscript: “But the 13A circuitry can still produce rhythmic behavior even without those external inputs (or when set to a constant value), although the legs become less coordinated.” Indeed, when we refine the model (with the evolutionary training) without the “black box” (using a constant input of 0.1) the behavior is still rhythmic and sustained. Therefore, the rhythmic activity and behavior can emerge from the premotor circuitry itself without a rhythmic input.

      The context in which the 13A and 13B hemilineages sit also needs to be explained. What do we know about the other inputs to the motorneurons studied? What excitatory circuits are there? 

      We agree that there are many more excitatory and inhibitory, direct and indirect, connections to motor neurons that will also affect leg movements for grooming and walking. 13A neurons provide a substantial fraction of premotor input. For example, 13As account for ~17.1% of upstream synapses for one tibia extensor (femur seti) motor neuron and ~14.6% for another tibia extensor (femur feti) motor neuron. Our goal was to demonstrate what is possible from a constrained circuit of inhibitory neurons that we mapped in detail, and we hope to add additional components to better replicate the biological circuit as behavioral and biomechanical data is obtained by us and others.  

      Furthermore, the introduction ignores many decades of work in other species on the role of inhibitory cell types in motor systems. There is some mention of this in the discussion, but even previous work in Drosophila larvae is not mentioned, nor crustacean STG, nor any other cell types previously studied. This manuscript makes a valuable contribution, but it is not the first to study inhibition in motor systems, and this should be made clear to the reader.

      We thank the reviewer for this important reminder.  Previous work on the contribution of inhibitory neurons to invertebrate motor control certainly influenced our research. We have expanded coverage of the relevant history and context in our revised discussion.

      (2) The experimental evidence is not always presented convincingly, at times lacking data, quantification, explanation, appropriate rationales, or sufficient interpretation.

      We are committed to improving the clarity, rationale, and completeness of our experimental descriptions.  We have revisited the statistical tests applied throughout the manuscript and expanded the Methods.

      (3) The statistics used are unlike any I remember having seen, essentially one big t-test followed by correction for multiple comparisons. I wonder whether this approach is optimal for these nested, high‐dimensional behavioral data. For instance, the authors do not report any formal test of normality. This might be an issue given the often skewed distributions of kinematic variables that are reported. Moreover, each fly contributes many video segments, and each segment results in multiple measurements. By treating every segment as an independent observation, the non‐independence of measurements within the same animal is ignored. I think a linear mixed‐effects model (LMM) or generalized linear mixed model (GLMM) might be more appropriate.

      We thank the reviewer for raising this important point regarding the statistical treatment of our segmented behavioral data. Our initial analysis used independent t-tests with Bonferroni correction across behavioral classes and features, which allowed us to identify broad effects. However, we acknowledge that this approach does not account for the nested structure of the data. To address this, we re-analyzed key comparisons using linear mixed-effects models (LMMs) as suggested by the reviewer. This approach allowed us to more appropriately model within-fly variability and test the robustness of our conclusions. We have updated the manuscript based on the outcomes of these analyses.

      (4) The manuscript mentions that legs are used for walking as well as grooming. While this is welcome, the authors then do not discuss the implications of this in sufficient detail. For instance, how should we interpret that pulsed stimulation of a subset of 13A neurons produces grooming and walking behaviours? How does neural control of grooming interact with that of walking?

      We do not know how the inhibitory neurons we investigated will affect walking or how circuits for control of grooming and walking might compete. We speculate that overlapping pre-motor circuits may participate because both have similar extension flexion cycles at similar frequencies, but we do not have hard experimental data to support. This would be an interesting area for future research. Here, we focused on the consequences of activating specific 13A/B neurons during grooming because they were identified through a behavioral screen for grooming disruptions, and we had developed high-resolution assays and familiarity with the normal movements in this behavior.

      (5) The manuscript needs to be proofread and edited as there are inconsistencies in labelling in figures, phrasing errors, missing citations of figures in the text, or citations that are not in the correct order, and referencing errors (examples: 81 and 83 are identical; 94 is missing in text).

      We have proofread the manuscript to fix figure labeling, citation order, and referencing errors.

      Reviewing Editor Comments:

      In addition to the recommendations listed below, a common suggestion, given the lack of evidence to support that 13A and 13B are rhythm-generating, is to tone down the title to something like, for example, "Inhibitory circuits control leg movements during grooming in Drosophila" (or similar).

      We changed the title to Inhibitory circuits control leg movements during Drosophila  grooming

      Reviewer #1 (Recommendations for the authors):

      (1) Naming of movements of leg segments:

      The authors refer to movements of leg segments across the leg, i.e., of all joints, as "flexion" and "extension". For example, in Figure 4A and at many other places. This naming is functionally misleading for two reasons: (i) the anatomical organization of an insect leg differs in principle from the organization of the mammalian leg, which the manuscript often refers to. While the organization of a mammalian limb is planar the organization of the insect limb shows a different plane as compared to the body length axis (for detailed accounts see Ritzmann et al. 2004; Büschges & Ache, 2024); (ii) the reader cannot differentiate between places in the text, where "flexion" and "extension" refer to movements of the tibia of the femur-tibia joint, e.g. in the graphical abstract, in Figure 3 and its supplements, and other places, e.g. Figure 4 and its supplements, where these two words refer to movements of leg segments of other joints, e.g. thorax-coxa, coxa-trochanter and tarsal joints. The reviewer strongly suggests naming the movements of the leg segments according to the individual joint and its muscles.

      We accept this helpful suggestion. We now include a description of the leg segments and joints in the revised Introduction and refer to which leg segments we mean   

      “The adult Drosophila leg consists of serially arranged joints—bodywall/thoraco-coxal (Th-C), coxa–trochanter (C-Tr), trochanter–femur (Tr-F), femur–tibia (F-Ti), tibia–tarsus (Ti-Ta)—each powered by opposing flexor and extensor muscles that transmit force through tendons (Soler et al., 2004). The proximal joints, Th-C and C-Tr, mediate leg protraction–retraction and elevation–depression, respectively (Ritzmann et al., 2004; Büschges & Ache, 2025). The medial joint, F-Ti, acts as the principal flexion–extension hinge and is controlled by large tibia extensor motor neurons and flexor motor neurons (Soler et al., 2004; Baek and Mann 2009; Brierley et al., 2012; Azevedo et al., 2024; Lesser et al., 2024). By contrast, distal joints such as Ti-Ta and the tarsomeres contribute to fine adjustments, grasping, and substrate attachment (Azevedo et al., 2024).”

      We also clarified femur-tibia joints in the graphical abstract, modified Figure 3 legend and added joints at relevant places.

      (2)  Figures 3, 4, and 5 with supplements:

      The authors optogenetically silence and activate (sub)populations of 13A and 13B interneurons. Changes in frequency of movements and distance between legs or leg movements are interpreted as the effect of these experimental paradigms. No physiological recordings from leg motoneurons or leg muscles are shown. While I understand the notion of the authors to interpret a movement as the outcome of activity in a muscle, it needs to be remembered that it is well known that fast cyclic leg movements, including those for grooming, cannot be used to conclude on the underlying neural activity. Zakotnik et al. (2006) and others provided evidence that such fast cyclic movements can result from the interaction of the rhythmic activity of one leg muscle only, together with the resting tension of its silent antagonist. Given that no physiological recordings are presented, this needs to be mentioned in the discussion, e.g., in the section "Inhibitory Innervation Imbalance.......".

      Added studies from Heitler, 1974; Bennet-Clark, 1975; Zakotnik et al., 2006; Page et al., 2008 in discussion.

      (3) Introduction and Discussion:

      The authors refer extensively to work on the mammalian spinal cord and compare their own work with circuit elements found in the spinal cord. From the perspective of the reviewer this notion is in conflict with acknowledging prior research work on the role of inhibitory network interactions for other invertebrates and lower vertebrates: such are locust flight system (for feedforward inhibition, disinhibition), crustacean stomatogastric nervous system (reciprocal inhibition), clione swimming system (reciprocal inhibition, feedforward inhibition, disinhibition), leech swimming system (reciprocal inhibition, disinhibition, feedforward inhibition), xenopus swimming system (reciprocal inhibition). The next paragraph illustrates this criticism/suggestion for stick insect neural circuits for leg stepping.

      (4) Discussion:

      "Feedforward inhibition" and "Disinhibition": it is already been described that rhythmic activity of antagonistic insect leg motoneuron pools arises from alternating synaptic inhibition and disinhibition of the motoneurons from premotor central pattern generating networks, e.g., Büschges (1998); Büschges et al. (2004); Ruthe et al. (2024).

      We have added these references to the revised Discussion.

      (5) Circuit motifs of the simulation, i.e., mutual inhibition between interneurons and onto motoneurons and sensory feedback influences and pathways share similarities to those formerly used by studies simulating rhythmic insect leg movements, for example, Schilling & Cruse 2020, 2023 or Toth et al. 2012. For the reader, it appears relevant that the progress of the new simulation is explained in the light of similarities and differences to these former approaches with respect to the common circuit motifs used.

      We now put our work in the context of other models in the Discussion section: “Similar circuit motifs, namely reciprocal inhibitions between pre-motor neurons and the sensory feedback have been modeled before, in particular neuroWalknet, and such simple motifs do not require a separate CPG component to generate rhythmic behavior in these models (Schilling & Cruse 2020, 2023). However, our model is much simpler than the neuroWalknet - it controls a 2D agent operating on an abstract environment (the dust distribution), without physics. In real animals or complex mechanical models such as NeuroMechFly (Lobato-Rios et al), a more explicit central rhythm generation may be advantageous for the coordination across many more degrees of freedom.”

      Reviewer #2 (Recommendations for the authors):

      I might have missed this, but I couldn't find any mention of how the grooming command pathways, described by previous work from the authors' lab, recruit these predicted grooming pattern-generating neurons. This should be mentioned in the connectome analysis and also discussed later in the discussion.

      13A neurons are direct downstream targets of previously described grooming command neurons. Specifically, the antennal grooming command neuron aDN (Hampel et al., 2015) synapses onto two primary 13As (γ and α; 13As-i) that connect to proximal extensor and medial flexor motor neurons, as well as four other 13As (9a, 9c, 9i, 6e) projecting to body wall extensor motor neurons. The 13As-i also form reciprocal connections with 13As-ii, providing a potential substrate for oscillatory leg movements. aDN connects to homologous 13As on both sides, consistent with the bilateral coordination needed for antennal sweeping. 

      The head grooming/leg rubbing command neuron DNg12 (Guo et al., 2022)  synapses directly onto ~50 13As, predominantly those connected to proximal motor neurons. 

      While sometimes the structural connectivity suggests pathways for generating rhythmic movements, the extensive interconnections among command neurons and premotor circuits indicate that multiple motifs could contribute to the observed behaviors. Further work will be needed to determine how these inputs are dynamically engaged during normal grooming sequences. We have now added it to the discussion.

      I encourage the authors to be explicit about caveats wherever possible: e.g., ectopic expression in genetic tools, potential for other unexplored neurons as rhythm generators (rather than 13A/B), given that the authors never get complete silencing phenotypes, CsChrimson kinetics, neurotransmitter predictions, etc.

      We now explain these caveats as follows: Ectopic expression is noted in Figure 1—figure supplement 1, and we added the following to the Discussion: “While our experiments with multiple genetic lines labeling 13A/B neurons consistently implicate these cells in leg coordination, ectopic expression in some lines raises the possibility that other neurons may also contribute to this phenotype. In addition, other excitatory and inhibitory neural circuits, not yet identified, may also contribute to the generation of rhythmic leg movements. Future studies should identify such neurons that regulate rhythmic timing and their interactions with inhibitory circuits.”

      We also added a caveat regarding CsChrimson kinetics in the Results. Finally, our identification of these neurons as inhibitory is based on genetic access to the GABAergic population (we use GAD-spGAL4 as part of the intersection which targets them), rather than on predictions of neurotransmitter identity.

      Reviewer #3 (Recommendations for the authors):

      Detailed list of figure alterations:

      (1) Figure 1:

      (a) Figure 1B and Figure 1 - Figure Supplement 1 lack information on individual cells - how can we tell that the cells targeted are indeed 13A and 13B, and which ones they are? Since off-target expression in neighboring hemilineages isn't ruled out, the interpretation of results is not straightforward.

      The neurons labeled by R35G04-DBD and GAD1-AD are identified as 13A and 13B based on their stereotyped cell body positions and characteristic neurite projections into the neuropil, which match those of 13A and 13B neurons reconstructed in the FANC and MANC connectome. While we have not generated flip-out clones in this genotype, we do isolate 13A neurons more specifically later in the manuscript using R35G04-DBD intersected with Dbx-AD, and show single-cell morphology consistent with identified 13A neurons. The purpose of including this early figure was to motivate the study by showing that silencing this population, which includes 13A/13B neurons, strongly reduces grooming in dusted flies. 

      Regarding Figure 1—Figure Supplement 1:

      This figure showed the expression patterns of all lines used throughout the manuscript. Panels C and D illustrated lines with minimal to no ectopic expression. Panels A and B show neurons with posterior cell bodies that may correspond to 13A neurons not reconstructed in our dataset but described in Soffers et al., 2025 and Marin et al., 2025 and we have provided detailed information about all VNC expressions in the figure legend.

      (b) Figure 1D lacks explanation of boxplots, asterisks, genotypes/experimental design.

      Added.

      (c) Figures 1E-F and video 1 lack quantification, scale bars.

      Added quantification.

      (2) Figure 2:

      (a) Figure 2A, Figure 2 - Supplement 3: What are the details of the hierarchical clustering? What metric was used to decide on the number of clusters? 

      We have used FANC packages to perform NBLAST clustering (Azevedo et al., 2024, Nature). We now include the full protocol in Methods.  The details are as follows:

      We performed hierarchical clustering on pairwise NBLAST similarity scores computed using navis.nblast_allbyall(). The resulting similarity matrix was symmetrized by averaging it with its transpose, and converted into a distance matrix using the transformation:

      distance=(1−similarity)\text{distance} = (1 - \text{similarity})distance=(1−similarity)

      This ensures that a perfect NBLAST match (similarity = 1) corresponds to a distance of 0.

      Clustering was performed using Ward’s linkage method (method='ward' in scipy.cluster.hierarchy.linkage), which minimizes the total within-cluster variance and is well-suited for identifying compact, morphologically coherent clusters.

      We did not predefine the number of clusters. Instead, clusters were visualized using a dendrogram, where branch coloring is based on the default behavior of scipy.cluster.hierarchy.dendrogram(). By default, this function applies a visual color threshold at 70% of the maximum linkage distance to highlight groups of similar elements. In our dataset, this corresponded to a linkage distance of approximately 1–1.5, which visually separated morphologically distinct neuron types (Figures 2A and Figure 2—figure supplement 3A). This threshold was used only as a visual aid and not as a hard cutoff for quantitative grouping.

      The Methods section says that the classification "included left-right comparisons". What does that mean? What are the implications of the authors only having proofread a subset of neurons in T1L (see below)? 

      All adult leg motor neurons and 13A neurons (except one, 13A-ε) have neurite arbors restricted to the local, ipsilateral neuropil associated with the nearest leg.  Although 13B neurons have contralateral cell bodies, their projections are also entirely ipsilateral. The Tuthill Lab, with contributions from our group, focused proofreading efforts on the left front neuropil (T1L) in FANC. This is also where the motor neuron to muscle mapping has been most extensively done. We reconstructed/proofread the 13A and 13B neurons from the right side as well (T1R). We see similar clustering based on morphology and connectivity here as well.  

      Reconstructions lack scale bars and information on orientation (also in other figures), and the figures for the 13B analysis are not consistent with the main figure (e.g., labelling of clusters in panel B along x,y axes).

      Added.  

      (b) Figure 2B: Since the cosine similarity matrix's values should go from -1 to 1, why was a color map used ranging from 0 to 1? 

      While cosine similarity values can theoretically range from -1 to 1, in our case, all vector entries (i.e., synaptic weights) are non-negative, as they reflect the number of synapses from each 13A neuron to its downstream targets. This means all pairwise cosine similarities fall within the 0 to 1 range. 

      Why are some neurons not included in this figure, like 1g, 2b, 3c-f (also in Supplement 3)?

      The few 13A neurons that don’t connect to motor neurons are not shown in the figure.

      (c) Figures 2C and D: the overlaid neurites are difficult to distinguish from one another. If the point here is to show that each 13A neuron class innervates specific motor neurons, then this is not the clearest way of doing that. For instance, the legend indicates that extensors are labelled in red, and that MNs with the highest number of synapses are highlighted in red - does that work? I could not figure out what was going on. On a more general point: if two cells are connected, does that not automatically mean that they should overlap in their projection patterns?

      We intended these panels to illustrate that 13A neurons synapse onto overlapping regions of motor neurons, thereby creating a spatial representation of muscle targets. However, we agree that overlapping multiple neurons in a single flat projection makes the figure difficult to interpret. We have therefore removed Figures 2C and 2D.

      While neurons must overlap at least somewhere if they form a synaptic connection, the amount of their neurites that overlap can vary, and more extensive overlap suggests more possible connections. Because the synapses are computationally predicted, examining the overlap helps to confirm that these predictions are consistent.

      While connected neurons must overlap locally at their synaptic sites, they do not necessarily show extensive or spatially structured overlap of their projections. For example, descending neurons or 13B interneurons may form synapses onto motor neurons without exhibiting a topographically organized projection pattern. In contrast, 13A→MN connectivity is organized in a structured manner: specialist 13A neurons align with the myotopic map of MN dendrites, whereas generalist 13As project more broadly and target MN groups across multiple leg segments, reflecting premotor synergies. This spatial organization—combining both joint-specific and multi-joint representations—was a key finding we wished to highlight, and we have revised the Results text to make this clearer.

      (d) Figure 2 - Figure Supplement 1: Why are these results presented in a way that goes against the morphological clustering results, but without explanation? Clusters 1-3 seem to overlap in their connectivity, and are presented in a mixed order. Why is this ignored? Are there similar data for 13B?

      The morphological clusters 1–3 do exhibit overlapping connectivity, but this is consistent with both their anatomical similarity and premotor connectivity. Specifically, Cluster 1 neurons connect to SE and TrE motor neurons, Cluster 2 connects only to TrE motor neurons, and Cluster 3 targets multiple motor pools, including SE and TrE (Figure 2—Figure Supplement 1B). This overlap is also reflected in the high pairwise cosine similarity among Clusters 1–3 shown in Figure 2B. Thus, their similar connectivity profiles align with their proximity in the NBLAST dendrogram.

      Regarding 13B neurons: there is no clear correlation between morphological clusters and downstream motor targets, as shown in the cosine similarity matrix (Figure 2—figure supplement 3). Moreover, even premotor 13B neurons that fall within the same morphological cluster do not connect to the same set of motor neurons (Figure 3—figure supplement 1F). For example, 13B-2a connects to LTrM and tergo-trochanteral MNs, 13B-2b connects to TiF MNs, and 13B-2g connects to Tr-F, TiE, and tergo-T MNs. Together, these results demonstrate that 13A neurons are spatially organized in a manner that correlates with their motor neuron targets, whereas 13B neurons lack such spatially structured organization, suggesting distinct principles of connectivity for these two inhibitory premotor populations.

      (e) Figure 2 - Figure Supplement 2: A comparison is made here between T1R (proofread) and T1L (largely not proofread). A general point is made here that there are "similar numbers of neurons and cluster divisions". First, no quantitative comparison is provided, making it difficult to judge whether this point is accurate. Second, glancing at the connectivity diagram, I can identify a large number of discrepancies. How should we interpret those? Can T1L be proofread? If this is too much of a burden, results should be presented with that as a clear caveat.

      The 13A and 13B neurons in the T1L hemisegment are fully proofread (Lesser et al, 2024, current publication); the T1R has been extensively analyzed as well.  To compare the clustering and match identities of 13A and 13B neurons on the left and the right, We mirrored the 13A neurons from the left side and used NBLAST to match them with their counterparts on the right.

      While individual synaptic counts differ between sides in the FANC dataset (T1L generally showing higher counts), the number of 13A neurons, their clustering, and the overall patterns of connectivity are largely conserved between T1L and T1R.

      Importantly, each 13A cluster targets the same subset of motor neurons on both sides, preserving the overall pattern of connectivity. The largest divergence is seen in cluster 9, which shows more variable connectivity.  

      (f) Figure 2 - Figure Supplements 4 & 5: Why did the authors choose to present the particular cell type in Supplement 4?  Why are the cell types in Supplement 5 presented differently? Labels in Supplement 5 are illegible, but I imagine this is due to the format of the file presented to reviewers. Why are there no data for 13B?

      We chose to present the particular cell type in Supplement 4 because it corresponds to cell types targeted in the genetic lines used in our behavioral experiments. The 13A neuron shown is also one of the primary neurons in this lineage. This example illustrates its broader connectivity beyond the inhibitory and motor connections emphasized in the main figures.

      In Supplement 5, we initially aimed to highlight that the major downstream targets of 13A neurons are motor neurons. We have now removed this figure and instead state in the text that the major downstream targets are MNs.

      We did not present 13B neurons in the same format because their major downstream targets are not motor neurons. Instead, we emphasize their role in disinhibition and their connections to 13A neurons, as shown in a specific example in Figure 3—figure supplement 2. This 13B neuron also corresponds to a cell type targeted in the genetic line used in our behavioral experiments.

      (3) Figure 3:

      (a) Figure 3A: the collection of diagrams is not clear. I'd suggest one diagram with all connections included repeated for each subpanel, with each subpanel highlighting relevant connections and greying out irrelevant ones to the type of connection discussed. The nomenclature should be consistent between the figure and the legend (e.g., feedforward inhibition vs direct MN inhibition in A1.

      The intent of Figure 3A is to highlight individual circuit motifs by isolating them in separate panels. Including all connections in every sub panel would likely reduce clarity and make it harder to follow each motif. For completeness, we show the full set of connections together in Panel D. We updated the nomenclature as suggested. 

      (b) Figure 3B: Why was the medial joint discussed in detail? Do the thicknesses of the lines represent the number of synapses? There should be a legend, in that case. Why are the green edges all the same thickness? Are they indeed all connected with a similarly low number of synapses?

      We focused on the medial joint (femur-tibia joint) because it produces alternating flexion and extension of the tibia during both head sweeps and leg rubbing, which are the main grooming actions we analyzed. During head grooming, the tarsus is typically suspended in the air, so the cleaning action is primarily driven by tibial movements generated at the medial joint. 

      The thickness of the edges represents the number of synapses, and we have now clarified this in the legend. The green edges represent connections from 13B neurons, which were manually added to the graph, as described in the Methods section. 13B neurons are smaller than 13A neurons and form significantly fewer total downstream synapses. For example, the 13B neuron shown in Figure 3—figure supplement 2 makes a total of 155 synapses to all downstream neurons, with only 22 synapses to its most strongly connected partner, a 13A neuron. The relatively sparse connectivity of 13B neurons is shown in thinner or uniform edge weights in this graph.

      (C) Figure 3C: This is a potentially important panel, but the connections are difficult to interpret. Moreover, the text says, "This organizational motif applies to multiple joints within a leg as reciprocal connections between generalist 13A neurons suggest a role in coordinating multi-joint movements in synergy". To what extent is this a representative result? The figure also has an error in the legend (it is not labelled as 3C).

      This statement is true and based on the connectivity of these neurons. We now added

      “Data for 13A-MN connections shown in Figure 2—figure supplement 1 I9, I6, I7, H9, H4, and H5; 13A-13A connections shown in Figure 3—figure supplement 1C.” to the figure legend.

      Thanks, we fixed the labelling error.

      (d) Figure 3 - Figure Supplement 1: Panel A is very difficult to interpret. Could a hierarchical diagram be used, or some other representation that is easier to digest?

      Panel A provides a consolidated view of all upstream and downstream interconnections among individual 13A and 13B neurons, allowing readers to quickly assess which neurons connect to which others without having to examine all subpanels. For a hierarchical representation, we have provided individual neuron-level diagrams in Panels C–F. 

      (e) Figure 3 - Figure Supplement 2: Why was this cell type selected?

      We selected this 13B because it is involved in the disinhibition of 13A neurons and is also present in the genetic line used for our behavioral experiments. 

      (f) Figure 3 - Figure Supplement 3: The diagram is confusing, with text aligned randomly, and colors lacking some explanations. Legend has odd formatting.

      The diagram layout and text alignment are designed to reflect the logical grouping of proprioceptors, 13A neurons, and motor neurons. To improve clarity, we have added node colors, included a written explanation for edge colors, and corrected the formatting of the figure legend.

      (4) Figure 4:

      (a) Figure 4A: This has no quantification, poor labelling, and odd units (centiseconds?). The colours between the left and right panels also don't align.

      We have fixed these issues.

      (b) Figure 4D-K: The ranges on the different axes are not the same (e.g., y axis on box plots, x axis on histograms). This obscures the fact that the differences between experimental and control, which in many cases are not big, are not consistent between the various controls. Moreover, the data that are plotted are, as far as I can tell (which is also to say: this should be explained), one value per frame. With imaging at 100Hz, this means that an enormous number of values are used in each analysis. Very small differences can therefore be significant in a statistical sense. However, how different something is between conditions is important (effect size), and this is not taken int account in this manuscript. For instance, in 4D-J, the differences in the mean seem to be minimal. Should that not be taken into consideration? A point in case is panel D in Figure 4 - Figure Supplement 1: even with near identical distributions, a statistically significant difference is detected. The same applies to Figure 4 - Figure Supplements 1-3. Also, what do the boxes and whiskers in the box plots show, exactly?

      We have re-plotted all summary panels using linear mixed-effects models (LMMs) as suggested. In the updated plots, each dot represents the mean value for a single animal, and bar height represents the group mean. Whiskers indicate the 95% confidence interval around the group mean. This approach avoids inflating sample size by using per-frame values and provides a more accurate view of both variability and effect size. 

      (e) Figure 4 - Figure Supplement 1: There are 6 cells labelled in the split line; only 4 are shown in A3. Is cluster 6 a convincing match between EM and MCFO?

      We indeed report four neurons targeted by the split-GAL4 line in flip out clones. Generating these clones was technically challenging. In our sample (n=23), we may not have labeled all of the neurons.  Alternatively, two neurons may share very similar morphology and connectivity, making it difficult to tell them apart. We have added this clarification to the revised figure legend.

      It is interesting to see data on walking in panel K, but why were these analyses not done on any of the other manipulations? What defect produced the reduction in velocity, exactly? How should this be interpreted?

      Our primary focus was on grooming, but we did observe changes in walking, so we report illustrative examples. We initially included a panel showing increased walking velocity upon 13A activation, but this effect did not survive FDR correction and was removed in the revised version. We instead included data for 13A silencing which did not affect the frequency of joint movements during walking. However, spatial aspects of walking were affected: the distance between front leg tips during stance was reduced, indicating that although flies continued to walk rhythmically, the positioning of the legs was altered. This suggests that these specific 13A neurons may influence coordination and limb placement during walking without disrupting basic rhythmicity. As reviewer #2 also noted, dust may itself affect walking, so we have chosen not to further pursue this aspect in the current study.

      (f) Figure 4 - Figure Supplement 2: panel A is identical to Figure 1 - Figure Supplement 1C. This figure needs particular attention, both in content and style. Why present data on silencing these neurons in C-D, but not in E-F?

      We removed the panel Figure 1 - Figure Supplement 1C and kept it in Figure 4 - Figure Supplement 2 A. E-F also shows data on silencing, as C’.

      (g) Figure 4 - Figure Supplement 3: In panel B, the authors should more clearly demonstrate the identity of 4b and 4a. Why present such a limited number of parameters in F and G?

      The cells shown in panel B represent the best matches we could identify between the light-level expression pattern and EM reconstructions. In panels F and G, we focused on bout duration, as leg position/inter-leg distance and frequency were already presented (in Figure 4). Together, these parameters demonstrate the role of 13B neurons in coordinating leg movements. Maximum angular velocity of proximal joints was not significantly affected and is therefore not included.

      (5) Figure 5:

      (a) Figure 5B: Lacks a quantification of the periodic nature of the behavior, which is required to compare to experimental conditions, e.g., in panel C.

      Added

      (b) Figure 5C: Requires a quantification; stimulus dynamics need to be incorporated.

      Added

      (c) Figure 5D: More information is needed. Does "Front leg" mean "leg rub", and "Head" "head sweep"? How do the dynamics in these behaviors compare to normal grooming behavior?

      Yes, head grooming is head sweeps and Front leg grooming is leg rub. Comparison added, shown in 5E-F

      (d) Figure 5E: How should we interpret these plots? Do these look like normal grooming/walking?

      We have now included the comparison.

      (e) Figure 5F: Needs stats to compare it to 5B'.

      Done

      (6) Figure 6:

      (a) Figure 6A: I think the circuit used for the model is lacking the claw/hook extension - 13Bs connection. Any other changes? What is the rationale?

      13Bs upstream of these particular 13As do not receive significant connections from claw/hook neurons (there’s only one ~5 synapses connection from one hook extension to one 13B neurons, which we neglected for the modeling purpose). 

      (b) Figure 6B and C: Needs labels, legend; where is 13B?

      In the figure legend we now added: “The 13B neurons in this model do not connect to each other, receive excitatory input from the black box, and only project to the 13As (inhibitory). Their weight matrix, with only two values, is not shown.” We added the colorbar and corrected the color scheme.

      (c) Figure 6D-H: plots are very difficult to interpret. Units are also missing (is "Time" correct?).

      The units are indeed Time in frames (of simulation). We added this to the figure and the legend. We clarified the units of all variables in these panels. Corrected the color scheme and added their meaning to the legend text.

      (d) Figure 6I: I think the authors should consider presenting this in a different format.

      (e)  Figure 6 J and K (also Figure Supplement): lacks labels.

      We added labels for the three joints, increased the size of fonts for clarity, and added panel titles on the top.

      More specific suggestions:

      (1) It would be helpful if the titles of all figures reflected the take-away message, like in Figure 2.

      (2) "Their dendrites occupy a limited region of VNC, suggesting common pre-synaptic inputs" - all dendrites do, so I'd suggest rephrasing to be more precise.

      (3) "We propose that the broadly projecting primary neurons are generalists, likely born earlier, while specialists are mostly later-born secondary neurons" - this needs to be explained.

      We added the explanation.

      We propose that the broadly projecting primary neurons are generalists, likely born earlier, while specialists are mostly later-born secondary neurons. This is consistent with the known developmental sequence of hemilineages, where early-born primary neurons typically acquire larger arbors and integrate across broader premotor and motor targets, whereas later-born secondary neurons often have more spatially restricted projections and specialized roles[18,19,81,82,85]. Our morphological clustering supports this idea: generalist 13As have extensive axonal arbors spanning multiple leg segments, whereas specialist neurons are more narrowly tuned, connecting to a few MN targets within a segment. Thus, both their morphology and connectivity patterns align with the expectation from birth-order–dependent diversification within hemilineages.

      (4) "We did not find any correlation between the morphology of premotor 13B and motor connections" - this needs to be explained, as morphology constrains connectivity.

      We agree that morphology often constrains connectivity. However, in contrast to 13A neurons—where morphological clusters strongly predict MN connectivity—we did not observe such a correlation for 13B neurons. As we noted in our response to comment 2d, 13B neurons can form synapses onto MNs without exhibiting extensive or spatially structured overlap of their axonal projections with MN dendrites. This suggests that 13B→MN connectivity may be governed by more local, synapse-specific rules rather than by large-scale morphological positioning, in contrast to the spatially organized premotor map we observe for 13As.

      (5) "Based on their connectivity, we hypothesized that continuously activating them might reduce extension and increase flexion. Conversely, silencing them might increase extension and reduce flexion." - these clear predictions are then not directly addressed in the results that follow.

      We have now expanded this section.

      (6) "Thus, 13A neurons regulate both spatial and temporal aspects of leg coordination" "Together, 13A and 13B neurons contribute to both spatial and temporal coordination during grooming" - are these not intrinsically linked? This needs to be explained/justified.

      The spatial (leg positioning, joint angles) and temporal (frequency, rhythm) aspects are often linked, but they can be at least partially dissociated. This has been shown in other systems: for example, Argentine ants reduce walking speed on uneven terrain primarily by decreasing stride frequency while maintaining stride length (Clifton et al., 2020), and Drosophila larvae adjust crawling speed mainly by modulating cycle period rather than the amplitude of segmental contractions (Heckscher et al., 2012). Consistent with these findings, we observe that 13A neuron manipulation in dusted flies significantly alters leg positioning without changing the frequency of walking cycles. Thus, leg positioning can be perturbed while the number of extension–flexion cycles per second remains constant, supporting the view that spatial and temporal features are at least partially dissociable.

      (7) "Connectome data revealed that 13B neurons disinhibit motor pools (...) One of these 13B neurons is premotor, inhibiting both proximal and tibia extensor MN" - these are not possible at the same time.

      We show that the 13B population contains neurons with distinct connectivity motifs:

      some inhibit premotor 13A neurons (leading to disinhibition of motor pools), while others directly inhibit motor neurons. The split-GAL4 line we use labels three 13B neurons—two that inhibit the primary 13A neuron 13A-9d-γ (which targets proximal extensor and medial flexor MNs) and one that is premotor, directly inhibiting both proximal and tibia extensor MNs. Although these functions may appear mutually exclusive, their combined action could converge to a similar outcome: disinhibition of proximal extensor and medial flexor MNs while simultaneously inhibiting medial extensor MNs. This suggests that the labeled 13B neurons act in concert to bias the network toward a specific motor state rather than producing contradictory effects.

      (8) "we often observed that one leg became locked in flexion while the other leg remained extended, (indicating contribution from additional unmapped left right coordination circuits)." - Are these results not informative? I'd suggest the authors explain the implications of this more, rather than mentioning it within brackets like this.

      We agree with the reviewer that these results are highly informative. The observation that one leg can remain locked in flexion while the other stays extended suggests that additional left–right coordination circuits are engaged during grooming. This cross-talk is likely mediated by commissural interneurons downstream of inhibitory premotor neurons, which have not yet been systematically studied. Dissecting these circuits will require a dedicated project combining bilateral connectomic reconstruction, studying downstream targets of these commissural neurons, and functional interrogation, which is beyond the scope of the current study.

      (9) "Indeed, we observe that optogenetic activation of specific 13A and 13B neurons triggers grooming movements. We also discover that" - this phrasing suggests that this has already been shown.external

      We replaced ‘indeed’ with “Consistent with this connectivity,”

      (10) "But the 13A circuitry can still produce rhythmic behavior even without those  sensory inputs (or when set to a constant value), although the legs become less coordinated." - what does this mean?

      We can train (fine-tune) the model without the descending inputs from the “black box” and the behavior will still be rhythmic, meaning that our modeled 13A circuit alone can produce rhythmic behavior, i.e. the rhythm is not generated externally (by the “black box”). We added Figure 7 to the MS and re-wrote this paragraph. In the revised manuscript we now state: “But the 13A circuitry can still produce rhythmic behavior even without those excitatory inputs from the “black box” (or when set to a constant value), although the legs become less coordinated (because they are “unaware” of each other’s position at any time). Indeed, when we refine the model (with the evolutionary training) without the “black box” (using instead a constant input of 0.1) the behavior is still rhythmic although somewhat less sustained (Figure 7). This confirms that the rhythmic activity and behavior can emerge from the modeled pre-motor circuitry itself, without a rhythmic input.”

      (11) "However, to explore the possibility of de novo emergent periodic behavior (without the direct periodic descending input) we instead varied the model's parameters around their empirically obtained values." - why do the authors not show how the model performs without tuning it first? What are the changes exactly that are happening as a result of the tuning? Are there specific connections that are lost? Do I interpret Figure 6B and C correctly when I think that some connections are lost (e.g., an SN-MN connection)? How does that compare to the text, which states that "their magnitudes must be at least 80% of the empirical weights"?

      Without the fine-tuning we do not get any behavior (the activation levels saturate). So, we tolerate 20% divergence from the empirically established weights and we keep the signs the same. However, in the previous version we allowed the weights to decrease below 20% of the empirical weight (as long as the sign didn’t change) but not above (the signs were maintained and synapses were not added or removed). We thank the reviewer for observing this important discrepancy. In the current version we ensured that the model’s weights are bounded in both directions (the tolerance = 0.2), but we also partially relaxed the constraint on adjacency matrix re-scaling (see Methods, the “The fine-tuning of the synaptic weights” section, where we now clarify more precisely how the evolving model is fitted to the connectome constraints). We then re-ran the fine-tuning process. The Figure 6B and C is now corrected with the properly constrained model, as well as other panels in the figure.  We also applied a better color scheme (now, blue is inhibitory and red is excitatory) for Fig. 6B and C.

      (12) "Interestingly, removing 13As-ii-MN connections to the three MNs (second row of the 13A → MN matrices in Figures 6B and C) does not have much effect on the leg movement (data not shown). It seems sufficient for this model to contract only one of the two antagonistic muscles per joint, while keeping the other at a steady state." - this is not clear.

      We repeated this test with the newly fine-tuned model and re-wrote the result as follows:  “...when we remove just the 13A-i-MN connections (which control the flexors of the right leg) we likewise get a complete paralysis of the leg. However, removing the 13A-ii-MN (which control the extensors of the right leg) has only a modest effect on the leg movement. So, we need the 13A-i neurons to inhibit the flexors (via motor neurons), but not extensors, in order to obtain rhythmic movements.”

      (13) The Discussion needs to reference the specific Results in all relevant sections.

      We have revised the discussion to explicitly reference the specific results.

      (14) "Flexors and extensors should alternate" - there are circumstances in which flexors and extensors should co-contract. For instance, co-contraction modulates joint stiffness for postural stability and helps generate forces required for fast movements.

      Thanks for pointing this out. We added “However, flexor–extensor co-contraction can also be functionally relevant, such as for modulating joint stiffness during postural stabilization or for generating large forces required for fast movements (Zakotnik et al., 2006; Günzel et al., 2022; Ogawa and Yamawaki 2025). Some generalist 13A neurons could facilitate co-contraction across different leg segments, but none target antagonistic motor neurons controlling the same joint. Therefore, co-contraction within a single joint would require the simultaneous activation of multiple 13A neurons.”

      (15) "While legs alternate between extension and flexion, they remain elevated during grooming. To maintain this posture, some MNs must be continuously activated while their antagonists are inactivated." - this is not necessarily correct. Small limbs, like those of Drosophila, can assume gravity-independent rest angles (10.1523/JNEUROSCI.5510-08.2009).

      We added it to discussion

      (16) The discussion "Spatial Mapping of premotor neurons in the nerve cord" seems to me to be making obvious points, and does not need to be included.

      We have now revised this section to highlight the significance of 13A spatial organization, emphasizing premotor topographic mapping, multi-joint movement modules, and parallels to myotopic, proprioceptive, and vertebrate spinal maps.

      (17) Key point, albeit a small one: "Normal activity of these inhibitory neurons is critical for grooming" - the use of the word critical is problematic, and perhaps typical of the tone of the manuscript. These animals still groom when many of these neurons are manipulated, so what does "critical" really mean?

      In this instance, we now changed “critical” to “important”. We observed that silencing or activating a large number (>8) 13A neurons or few 13A and B neurons together completely abolishes grooming in dusted flies as flies get paralyzed or the limbs get locked in extreme poses. Therefore we think we have a justification for the statement that these neurons are critical for grooming.  These neurons may contribute to additional behaviors, and there may be partially redundant circuits that can also support grooming. We have revised the manuscript  with the intention of clarifying both what we have observed and the limits.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors endeavor to capture the dynamics of emotion-related brain networks. They employ slice-based fMRI combined with ICA on fMRI time series recorded while participants viewed a short movie clip. This approach allowed them to track the time course of four non-noise independent components at an effective 2s temporal resolution at the BOLD level. Notably, the authors report a temporal sequence from input to meaning, followed by response, and finally default mode networks, with significant overlap between stages. The use of ICA offers a data-driven method to identify large-scale networks involved in dynamic emotion processing. Overall, this paradigm and analytical strategy mark an important step forward in shifting affective neuroscience toward investigating temporal dynamics rather than relying solely on static network assessments

      Strengths:

      (1) One of the main advantages highlighted is the improved temporal resolution offered by slice-based fMRI. However, the manuscript does not clearly explain how this method achieves a higher effective resolution, especially since the results still show a 2s temporal resolution, comparable to conventional methods. Clarification on this point would help readers understand the true benefit of the approach.

      (2) While combining ICA with task fMRI is an innovative approach to study the spatiotemporaldynamics of emotion processing, task fMRI typically relies on modeling the hemodynamic response (e.g., using FIR or IR models) to mitigate noise and collinearity across adjacent trials. The current analysis uses unmodeled BOLD time series, which might risk suffering from these issues.

      (3) The study's claims about emotion dynamics are derived from fMRI data, which are inherently affected by the hemodynamic delay. This delay means that the observed time courses may differ substantially from those obtained through electrophysiology or MEG studies. A discussion on how these fMRI-derived dynamics relate to - or complement - is critical for the field to understand the emotion dynamics.

      (4) Although using ICA to differentiate emotion elements is a convenient approach to tell a story, it may also be misleading. For instance, the observed delayed onset and peak latency of the 'response network' might imply that emotional responses occur much later than other stages, which contradicts many established emotion theories. Given the involvement of largescale brain regions in this network, the underlying reasons for this delay could be very complex.

      Concerns and suggestions:

      However, I have several concerns regarding the specific presentation of temporal dynamics in the current manuscript and offer the following suggestions.

      (1) One selling point of this work regarding the advantages of testing temporal dynamics is the application of slice-based fMRI, which, in theory, should improve the temporal resolution of the fMRI time course. Improving fMRI temporal resolution is critical for a research project on this topic. The authors present a detailed schematic figure (Figure 2) to help readers understand it. However, I have difficulty understanding the benefits of this method in terms of temporal resolution.

      (a) In Figure 2A, if we examine a specific voxel in slice 2, the slice acquisitions occur at 0.7s, 2.7s, and 4.7s, which implies a temporal resolution of 2s rather than 0.7s. I am unclear on how the temporal resolution could be 0.7s for this specific voxel. I would prefer that the authors clarify this point further, as it would benefit readers who are not familiar with this technology.

      We very much appreciate these concerns as they highlight shortcomings in our explanation of the method. Please note that the main explanation of the method (and comparison with expected HRF and FIR based methods) is done in Janssen et al. (2018, NeuroImage; see further explanations in Janssen et al., 2020). However, to make the current paper more selfcontained, we provided further explanation of the Slice-Based method in Figure 2. With respect to the specific concern of the reviewer, in the hypothetical example used in Figure 2, the temporal resolution of the voxel on slice 2 is 0.7s because it combines the acquisitions from stimulus presentations across all trials. Specifically, given the specific study parameters as outlined in Figures 2A and B, slice 2 samples the state of the brain exactly 0s after stimulus presentation on trial 1 (red color), 0.7s after stimulus presentation on trial 3 (green color), and 1.3s after stimulus presentation on trial 2 (yellow color). Thus after combining data acquisitions across these three 3 stimuli presentations, slice 2 has sampled the state of the brain at timepoints that are multiples of 0.7s starting from stimulus onset. This is why we say that the theoretical maximum temporal resolution is equal to the TR divided by the number of slices (in the example 2/3 = 0.7s, in the actual experiment 3/39 = 0.08s). In the current study we used temporal binning across timepoints to reduce the temporal resolution (to 2 seconds) and improve the tSNR.

      We have updated the legend of Figure 3 to more clearly explain this issue.

      (b) Even with the claim of an increased temporal resolution (0.7s), the actual data (Figure 3) still appears to have a 2s resolution. I wonder what specific benefit slice-based fMRI brings in terms of testing temporal dynamics, aside from correcting the temporal distortions that conventional fMRI exhibits.

      This is a good point. In the current experiment, the TR was 3s, but we extracted the fMRI signal at 2s temporal resolution, which means an increment of 33%. In this study we did not directly compare the impact of different temporal resolutions on the efficacy of detection of network dynamics. Indeed, we agree with the reviewer that there remain many unanswered questions about the issue of temporal resolution of the extracted fMRI signal and the impact on the ability to detect fMRI network dynamics. We think that questions such as those posed by the reviewer should be addressed in future studies that are directly focused on this issue. We have updated our discussion section (page 21-22) to more clearly reflect this point of view.

      (2) In task-fMRI, the hemodynamic response is usually estimated using a specific model (e.g., FIR, IR model; see Lindquist et al., 2009). These models are effective at reducing noise and collinearity across adjacent trials. The current method appears to be conducted on unmodeled BOLD time series.

      (a) I am wondering how the authors avoid the issues that are typically addressed by these HRF modeling approaches. For example, if we examine the baseline period (say, -4 to 0s relative to stimulus onset), the activation of most networks does not remain around zero, which could be due to delayed influences from the previous trial. This suggests that the current time course may not be completely accurate.

      We thank the reviewer for highlighting this issue. Let us start by reiterating what we stated above: That there are many issues related to BOLD signal extraction and fMRI network discovery in task-based fMRI that remain poorly understood and should be addressed in future work. Such work should explore, for example, the impact of using a FIR vs Slice-based method on the discovery of networks in task-fMRI. These studies should also investigate the impact of different types of baselines and baseline durations on the extraction of the BOLD signal and network discovery. For the present purposes, our goal was not to introduce a new technique of fMRI signal extraction, but to show that the slice-based technique, in combination with ICA, can be used to study the brain’s networks dynamics in an emotional task. In other words, while we clearly appreciate the reviewer’s concerns and have several other studies underway that directly address these concerns, we believe that such concerns are better addressed in independent research. See our discussion on page 21-22 that addresses this issue.

      (b) A related question: if the authors take the spatial map of a certain network and apply a modeling approach to estimate a time series within that network, would the results be similar to the current ICA time series?

      Interesting point. Typically in a modeling approach the expected HRF (e.g., the double gamma function) is fitted to the fMRI data. Importantly, this approach produces static maps of the fit between the expected HRF and the data. By contrast, model-free approaches such as FIR or slice-based methods extract the fMRI signal directly from the data without making apriori assumptions about the expected shape of the signal. These approaches do not produce static maps but instead are capable of extracting the whole-brain dynamics during the execution of a task (event-related dynamics). These data-driven approaches (FIR, SliceBased, etc) are therefore a necessary first step in the analyses of the dynamics of brain activity during a task. The subsequent step involves the analyses of these complex eventrelated brain dynamics. In the current paper we suggest that a straightforward way to do this is to use ICA which produces spatial maps of voxels with similar time courses, and hence, yields insights into the temporal dynamics of whole-brain fMRI networks. As we mentioned above, combining ICA with a high temporal resolution data-driven signal is new and there are many new avenues for research in this burgeoning new field.

      (3) Human emotion should be inherently fast to ensure survival, as shown in many electrophysiology and MEG studies. For example, the dynamics of a fearful face can occur within 100ms in subcortical regions (Méndez-Bértolo et al., 2016), and general valence and arousal effects can occur as early as 200ms (e.g., Grootswagers et al., 2020; Bo et al., 2022). In contrast, the time-to-peak or onset timing in the BOLD time series spans a much larger time range due to the hemodynamic delay. fMRI findings indeed add spatial precision to our understanding of the temporal dynamics of emotion, but could the authors comment on how the current temporal dynamics supplement those electrophysiology studies that operate on much finer temporal scales?

      We really like this point. One way that EEG and fMRI are typically discussed is that these two approaches are said to be complementary. While EEG is able to provide information on temporal dynamics, but not spatial localization of brain activity, fMRI cannot provide information on the temporal dynamics, but can provide insights into spatial localization. Our study most directly challenges the latter part of this statement. We believe that by using tasks that highlight “slow” cognition, fMRI can be used to reveal not only spatial but also temporal information of brain activity. The movie task that we used presumably relies on a kind of “slow” cognition that takes place on longer time scales (e.g., the construction of the meaning of the scene). Our results show that with such tasks, whole-brain networks with different temporal dynamics can be separated by ICA, at odds with the claim that fMRI is only good for spatial information. One avenue of future research would be to attempt such “slow” tasks directly with EEG and try to find the electrical correlates of the networks detected in the current study.

      We hope to have answered the concerns of the reviewer.

      (4) The response network shows activation as late as 15 to 20s, which is surprising. Could the authors discuss further why it takes so long for participants to generate an emotional response in the brain?

      We thank the reviewer for this question. Our study design was such that there was an initial movie clip that lasted 12.5s, which was then followed by a two-alternative forced-choice decision task (including a button press, 2.5s), and finally followed by a 10s rest period. We extracted the fMRI signal across this entire 25s period (actually 28s because we also took into account some uncertainty in BOLD signal duration). Network discovery using ICA then showed various networks with distinct time courses (across the 25s period), including one network (IC2 response) that showed a peak around 21s (see Figure 3). Given the properties of the spatial map (eg., activity in primary motor areas, Figure 4), as well as the temporal properties of its timecourse (e.g., peak close to the response stage of the task), we interpreted this network as related to generating the manual response in the two-alternative forced-choice decision task. Further analyses showed that this aspect of the task (e.g., deciding the emotion of the character in the movie clip) was also sensitive to the emotional content of the earlier movie clip (Figure 6 and 7).

      We have further clarified this aspect of our results (see pages 16-17). We thank the reviewer for pointing this out.

      (5) Related to 4. In many theories, the emotion processing stages-including perception, valuation, and response-are usually considered iterative processes (e.g., Gross, 2015), especially in real-world scenarios. The advantage of the current paradigm is that it incorporates more dynamic elements of emotional stimuli and is closer to reality. Therefore, one might expect some degree of dynamic fluctuation within the tested brain networks to reflect those potential iterative processes (input, meaning, response). However, we still do not observe much brain dynamics in the data. In Figure 5, after the initial onset, most network activations remain sustained for an extended period of time. Does this suggest that emotion processing is less dynamic in the brain than we thought, or could it be related to limitations in temporal resolution? It could also be that the dynamics of each individual trial differ, and averaging them eliminates these variations. I would like to hear the authors' comments on this topic.

      We thank the reviewer for this interesting question. We are assuming the reviewer is referring to Figure 3 and not Figure 5. Indeed what Figure 3 shows is the average time course of each detected network across all subjects and trial types. This figure therefore does not directly show the difference in dynamics between the different emotions. However, as we show in further analyses that examine how emotion modulates specific aspects of the fMRI signal dynamics (time to peak, peak value, duration) of different networks, there are differences in the dynamics of these networks depending on the emotion (Figure 6 and 7). Thus, our results show that different emotions evoked by movie clips differ in their dynamics. Obviously, generalizing this to say that in general, different emotions have different brain dynamics is not straightforward and would require further study (probably using other tasks, and other emotions). We have updated the discussion section as well as the caption of Figure 3 to better explain this issue (see also comments by reviewer 2).

      (6) The activation of the default mode network (DMN), although relatively late, is very interesting. Generally, one would expect a deactivation of this network during ongoing external stimulation. Could this suggest that participants are mind-wandering during the later portion of the task?

      Very good point. Indeed this is in line with our interpretation. The late activity of the default mode network could reflect some further processing of the previous emotional experience. More work is required to clarify this further in terms of reflective, mind-wandering or regulatory processing. We have updated our discussion section to better highlight this issue (see page 19).

      We thank the reviewer for their really insightful comments and suggestions!

      Reviewer #2 (Public review):

      Summary:

      This manuscript examined the neural correlates of the temporal-spatial dynamics of emotional processing while participants were watching short movie clips (each 12.5 s long) from the movie "Forrest Gump". Participants not only watched each film clip, but also gave emotional responses, followed by a brief resting period. Employing fMRI to track the BOLD responses during these stages of emotional processing, the authors found four large-scale brain networks (labeled as IC0,1,2,4) were differentially involved in emotional processing. Overall, this work provides valuable information on the neurodynamics of emotional processing.

      Strengths:

      This work employs a naturalistic movie watching paradigm to elicit emotional experiences. The authors used a slice-based fMRI method to examine the temporal dynamics of BOLD responses. Compared to previous emotional research that uses static images, this work provides some new data and insights into how the brain supports emotional processing from a temporal dynamics view.

      Thank you!

      Weaknesses:

      Some major conclusions are unwarranted and do not have relevant evidence. For example, the authors seemed to interpret some neuroimaging results to be related to emotion regulation. However, there were no explicit instructions about emotional regulation, and there was no evidence suggesting participants regulated their emotions. How to best interpret the corresponding results thus requires caution.

      We thank the reviewer for pointing this out. We have updated the limitations section of our Discussion section (page 20) to better qualify our interpretations.

      Relatedly, the authors argued that "In turn, our findings underscore the utility of examining temporal metrics to capture subtle nuances of emotional processing that may remain undetectable using standard static analyses." While this sentence makes sense and is reasonable, it remains unclear how the results here support this argument. In particular, there were only three emotional categories: sad, happy, and fear. These three emotional categories are highly different from each other. Thus, how exactly the temporal metrics captured the "subtle nuances of emotional processing" shall be further elaborated.

      This is an important point. We also discuss this limitation in the “limitations” section of our Discussion (page 20). We again thank the reviewer for pointing this out.

      The writing also contained many claims about the study's clinical utility. However, the authors did not develop their reasoning nor elaborate on the clinical relevance. While examining emotional processing certainly could have clinical relevance, please unpack the argument and provide more information on how the results obtained here can be used in clinical settings.

      We very much appreciate this comment. Note that we did not intend to motivate our study directly from a clinical perspective (because we did not test our approach on a clinical population). Instead, our point is that some researchers (e.g., Kuppens & Verduyn 2017; Waugh et al., 2015) have conceptualized emotional disorders frequently having a temporal component (e.g., dwelling abnormally long on negative thoughts) and that our technique could be used to examine if temporal dynamics of networks are affected in such disorders. However, as we pointed out, this should be verified in future work. We have updated our final paragraph (page 22) to more clearly highlight this issue. We thank the reviewer for pointing this out.

      Importantly, how are the temporal dynamics of BOLD responses and subjective feelings related? The authors showed that "the time-to-peak differences in IC2 ("response") align closely with response latency results, with sad trials showing faster response latencies and earlier peak times". Does this mean that people typically experience sad feelings faster than happy or fear? Yet this is inconsistent with ideas such that fear detection is often rapid, while sadness can be more sustained. Understandably, the study uses movie clips, which can be very different from previous work, mostly using static images (e.g., a fearful or a sad face). But the authors shall explicitly discuss what these temporal dynamics mean for subjective feelings.

      Excellent point! Our results indeed showed that sad trials had faster reaction times compared to happy and fearful trials, and that this result was reflected in the extracted time-to-peak measures of the fMRI data (see Figure 8D). To us, this primarily demonstrates that, as shown in other studies (e.g., Menon et al., 1997), that gross differences detected in behavioral measures can be directly recovered from temporal measures in fMRI data, which is not trivial. However, we do not think we are allowed to make interpretations of the sort suggested by the reviewer (and to be clear: we do not make such interpretations in the paper). Specifically, the faster reaction times on sad trials likely reflect some audio/visual aspect of the movie clips that result in faster reaction times instead of a generalized temporal difference in the subjective experience of sad vs happy/fearful emotions. Presumably the speed with which emotional stimuli influence the brain depends on the context. Perhaps future studies that examine emotional responses while controlling for the audio/visual experience could shed further light on this issue. We have updated the discussion section to address the reviewer’s concern.

      We thank the reviewer for the interesting points which have certainly improved our manuscript!

      Reviewer #1 (Recommendations for the authors):

      Minor:

      (1) Please add the unit to the y-axis in Figure 7, if applicable.

      Done. We have added units.

      (2) Adding a note in the legend of Figure 3 regarding the meaning of the amplitude of the timeseries would be helpful.

      Done. We have added a sentence further explaining the meaning of the timecourse fluctuations.

      Related references:

      (1) Lindquist, M. A., Loh, J. M., Atlas, L. Y., & Wager, T. D. (2009). Modeling the hemodynamic response function in fMRI: efficiency, bias, and mis-modeling. Neuroimage, 45(1), S187-S198.

      (2) Méndez-Bértolo, C., Moratti, S., Toledano, R., Lopez-Sosa, F., Martínez-Alvarez, R., Mah, Y. H., ... & Strange, B. A. (2016). A fast pathway for fear in human amygdala. Nature neuroscience, 19(8), 1041-1049.

      (3) Bo, K., Cui, L., Yin, S., Hu, Z., Hong, X., Kim, S., ... & Ding, M. (2022). Decoding the temporal dynamics of affective scene processing. NeuroImage, 261, 119532.

      (4) Grootswagers, T., Kennedy, B. L., Most, S. B., & Carlson, T. A. (2020). Neural signatures of dynamic emotion constructs in the human brain. Neuropsychologia, 145, 106535.

      (5) Gross, J. J. (2015). The extended process model of emotion regulation: Elaborations, applications, and future directions. Psychological inquiry, 26(1), 130-137.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      “Ejdrup, Gether, and colleagues present a sophisticated simulation of dopamine (DA) dynamics based on a substantial volume of striatum with many DA release sites. The key observation is that a reduced DA uptake rate in the ventral striatum (VS) compared to the dorsal striatum (DS) can produce an appreciable "tonic" level of DA in VS and not DS. In both areas they find that a large proportion of D2 receptors are occupied at "baseline"; this proportion increases with simulated DA cell phasic bursts but has little sensitivity to simulated DA cell pauses. They also examine, in a separate model, the effects of clustering dopamine transporters (DAT) into nanoclusters and say this may be a way of regulating tonic DA levels in VS. I found this work of interest and I think it will be useful to the community. At the same time, there are a number of weaknesses that should be addressed, and the authors need to more carefully explain how their conclusions are distinct from those based on prior models.

      We appreciate that the reviewer finds our work interesting and useful to the community. However, we acknowledge it is important to discuss how our conclusions are different from those reached based on previous model. Already in the original version of the manuscript we discussed our findings in relation to earlier models; however, this discussion has now been expanded. In particular, we would argue that our simulations, which included updated parameters, represent more accurate portrayals of in vivo conditions as it is now specifically stated in lines 466-487. Compared to previous models our data highlight the critical importance of different DAT expression across striatal subregions as a key determinant of differential DA dynamics and differential tonic levels in DS compared to VS. We find that these conclusions are already highlighted in the Abstract and Discussion. 

      (1) The conclusion that even an unrealistically long (1s) and complete pause in DA firing has little effect on DA receptor occupancy is potentially important. The ability to respond to DA pauses has been thought to be a key reason why D2 receptors (may) have high affinity. This simulation instead finds evidence that DA pauses may be useless. This result should be highlighted in the abstract and discussed more.“

      This is an interesting point. We have accordingly carried out new simulations across a range of D2R affinities to assess how this will affect the finding that even a long pause in DA firing has little effect on DR2 receptor occupancy. Interestingly, the simulations demonstrate that this finding is indeed robust across an order of magnitude in affinity, although the sensitivity to a one-second pause goes up as the affinity reaches 20 nM. The data are shown in a revised Figure S1H. For description of the results, please see revised text lines 195-197. The topic is now mentioned in the abstract as well as further commented in the Discussion in lines 500-504.

      “(2) The claim of "DAT nanoclustering as a way to shape tonic levels of DA" is not very well supported at present. None of the panels in Figure 4 simply show mean steady-state extracellular DA as a function of clustering. Perhaps mean DA is not the relevant measure, but then the authors need to better define what is and why. This issue may be linked to the fact that DAT clustering is modeled separately (Figure 4) to the main model of DA dynamics (Figures 1-3) which per the Methods assumes even distribution of uptake. Presumably, this is because the spatial resolution of the main model is too coarse to incorporate DAT nanoclusters, but it is still a limitation.”

      We agree with the reviewer that steady-state extracellular DA as a function of DAT clustering is a useful measure. We have therefore simulated the effects of different nanoclustering scenarios on this measure. We found that the extracellular concentrations went from approximately 15 nM for unclustered DAT to more than 30 nM in the densest clustering scenario. These results are shown in revised Figure 4F and described in the revised text in lines 337-349.

      Further, we fully agree that the spatial resolution of the main model is a limitation and, ideally, that the nanoclustering should be combined with the large-scale release simulations. Unfortunately, this would require many orders of magnitude more computational power than currently available.

      “As it stands it is convincing (but too obvious) that DAT clustering will increase DA away from clusters, while decreasing it near clusters. I.e. clustering increases heterogeneity, but how this could be relevant to striatal function is not made clear, especially given the different spatial scales of the models.”

      Thank you for raising this important point. While it is true that DAT clustering increases heterogeneity in DA distribution at the microscopic level, the diffusion rate is, in most circumstances, too fast to permit concentration differences on a spatial scale relevant for nearby receptors. Accordingly, we propose that the primary effect of DAT nanoclustering is to decrease the overall uptake capacity, which in turn increases overall extracellular DA concentrations. Thus, homogeneous changes in extracellular DA concentrations can arise from regulating heterogenous DAT distribution. An exception to this would be the circumstance where the receptor is located directly next to a dense cluster – i.e. within nanometers. In such cases, local DA availability may be more directly influenced by clustering effects. Please see revised text in lines 354-362 for discussion of this matter.  

      “(3) I question how reasonable the "12/40" simulated burst firing condition is, since to my knowledge this is well outside the range of firing patterns actually observed for dopamine cells. It would be better to base key results on more realistic values (in particular, fewer action potentials than 12).”

      We fully agree that this typically is outside the physiological range. The values are included in addition to more realistic values (3/10 and 6/20) to showcase what extreme situations would look like. 

      “(4) There is a need to better explain why "focality" is important, and justify the measure used.”

      We have expanded on the intention of this measure in the revised manuscript (please see lines 266-268).  Thank you for pointing out this lack of clarification.  

      “(5) Line 191: " D1 receptors (-Rs) were assumed to have a half maximal effective concentration (EC50) of 1000 nM" The assumptions about receptor EC50s are critical to this work and need to be better justified. It would also be good to show what happens if these EC50 numbers are changed by an order of magnitude up or down.”

      We agree that these assumptions are critical. Simulations on effective off-rates across a range of EC50 values has now been included in the revised version in Figure 1I and is referred to in lines 188-189.  

      “(6) Line 459: "we based our receptor kinetics on newer pharmacological experiments in live cells (Agren et al., 2021) and properties of the recently developed DA receptor-based biosensors (Labouesse & Patriarchi, 2021). Indeed, these sensors are mutated receptors but only on the intracellular domains with no changes of the binding site (Labouesse & Patriarchi, 2021)" 

      This argument is diminished by the observation that different sensors based on the same binding site have different affinities (e.g. in Patriarchi et al. 2018, dLight1.1 has Kd of 330nM while dlight1.3b has Kd of 1600nM).”

      We sincerely thank the reviewer for highlighting this important point. We fully recognize the fundamental importance of absolute and relative DA receptor kinetics for modeling DA actions and acknowledge that differences in affinity estimates from sensor-based measurements highlight the inherent uncertainty in selecting receptor kinetics parameters. While we have based our modeling decisions on what we believe to be the most relevant available data, we acknowledge that the choice of receptor kinetics is a topic of ongoing debate. Importantly, we are making our model available to the research community, allowing others to test their own estimates of receptor kinetics and assess their impact on the model’s behavior. In the revised manuscript, we have further elaborated the rationale behind our parameter choices. Please see revised text in lines in lines 177-178 of the Results section and in lines 481-486 of the Discussion. 

      “(7) Estimates of Vmax for DA uptake are entirely based on prior fast-scan voltammetry studies (Table S2). But FSCV likely produces distorted measures of uptake rate due to the kinetics of DA adsorption and release on the carbon fiber surface.”

      We fully agree that this is a limitation of FSCV. However, most of the cited papers attempt to correct for this by way of fitting the output to a multi-parameter model for DA kinetics. If newer literature brings the Vmax values estimated into question, we have made the model publicly available to rerun the simulations with new parameters.

      “(8) It is assumed that tortuosity is the same in DS and VS - is this a safe assumption?”

      The original paper cited does not specify which region the values are measured in. However, a separate paper estimates the rat cerebellum has a comparable tortuosity index (Nicholson and Phillips, J Physiol. 1981), suggesting it may be a rather uniform value across brain regions. This is now mentioned in lines 98-99 and the reference has been included. 

      “(9) More discussion is needed about how the conclusions derived from this more elaborate model of DA dynamics are the same, and different, to conclusions drawn from prior relevant models (including those cited, e.g. from Hunger et al. 2020, etc)”.

      As part of our revision, we have expanded the current discussion of our finding in the context of previous models in the manuscript in lines 466-487.

      Reviewer #2 (Public review): 

      The work presents a model of dopamine release, diffusion, and reuptake in a small (100 micrometers^2 maximum) volume of striatum. This extends previous work by this group and others by comparing dopamine dynamics in the dorsal and ventral striatum and by using a model of immediate dopamine-receptor activation inferred from recent dopamine sensor data. From their simulations, the authors report two main conclusions. The first is that the dorsal striatum does not appear to have a sustained, relatively uniform concentration of dopamine driven by the constant 4Hz firing of dopamine neurons; rather that constant firing appears to create hotspots of dopamine. By contrast, the lower density of release sites and lower rate of reuptake in the ventral striatum creates a sustained concentration of dopamine. The second main conclusion is that D1 receptor (D1R) activation is able to track dopamine concentration changes at short delays but D2 receptor activation cannot. 

      The simulations of the dorsal striatum will be of interest to dopamine aficionados as they throw some doubt on the classic model of "tonic" and "phasic" dopamine actions, further show the disconnect between dopamine neuron firing and consequent release, and thus raise issues for the reward-prediction error theory of dopamine. 

      There is some careful work here checking the dependence of results on the spatial volume and its discretisation. The simulations of dopamine concentration are checked over a range of values for key parameters. The model is good, the simulations are well done, and the evidence for robust differences between dorsal and ventral striatum dopamine concentration is good. 

      However, the main weakness here is that neither of the main conclusions is strongly evidenced as yet. The claim that the dorsal striatum has no "tonic" dopamine concentration is based on the single example simulation of Figure 1 not the extensive simulations over a range of parameters. Some of those later simulations seem to show that the dorsal striatum can have a "tonic" dopamine concentration, though the measurement of this is indirect. It is not clear why the reader should believe the example simulation over those in the robustness checks, for example by identifying which range of parameter values is more realistic.”

      We appreciate that the reviewer finds our work interesting and carefully performed.The reviewer is correct that DA dynamics, including the presence and level of tonic DA, are parameter-dependent in both the dorsal striatum (DS) and ventral striatum (VS). Indeed, our simulations across a broad range of biological parameters were intended to help readers understand how such variation would impact the model’s outcomes, particularly since many of the parameters remain contested. Naturally, altering these parameters results in changes to the observed dynamics. However, to derive possible conclusions, we selected a subset of parameters that we believe best reflect the physiological conditions, as elaborated in the manuscript. In response to the reviewer’s comment, we have placed greater emphasis on clarifying which parameter values we believe reflect the physiological conditions the most (see lines 155-157 and 254-255). Additionally, we have underscored that the distinction between tonic and non-tonic states is not a binary outcome but a parameter-dependent continuum (lines 222-225)—one that our model now allows researchers to explore systematically.  Finally, we have highlighted how our simulations across parameter space not only capture this continuum but also identify the regimes that produce the most heterogeneous DA signaling, both within and across striatal regions (lines 266-268).  

      “The claim that D1Rs can track rapid changes in dopamine is not well supported. It is based on a single simulation in Figure 1 (DS) and 2 (VS) by visual inspection of simulated dopamine concentration traces - and even then it is unclear that D1Rs actually track dynamics because they clearly do not track rapid changes in dopamine that are almost as large as those driven by bursts (cf Figure 1i).”

      We would like to draw the attention to Figure 1I, where the claim that D1R track rapid changes is supported in more depth (Figure S1 in original manuscript - moved to main figure to highlight this in the revised manuscript). According to this figure, upon coordinated burst firing, the D1R occupancy rapidly increased as diffusion no longer equilibrated the extracellular concentrations on a timescale faster than the receptors – and D1R receptor occupancy closely tracked extracellular DA with a delay on the order of tens of milliseconds. Note that the brief increases in [DA] from uncoordinated stochastic release events from tonic firing in Figure 1H are too brief to drive D1 signaling, as the DA concentration diffuses into the remaining extracellular space on a timescale of 1-5 ms. This is faster than the receptors response rate and does not lead to any downstream signaling according to our simulations. This means D1 kinetics are rapid enough to track coordinated signaling on a ~50 ms timescale and slower, but not fast enough to respond to individual release events from tonic activity.

      “The claim also depends on two things that are poorly explained. First, the model of binding here is missing from the text. It seems to be a simple bound-fraction model, simulating a single D1 or D2 receptor. It is unclear whether more complex models would show the same thing.”

      We realize that this is not made clear in the methods and, accordingly, we have updated the method section to elaborate on how we model receptor binding. The model simulates occupied fraction of D1R and D2R in every single voxel of the simulation space. Please see lines 546-555.

      “Second, crucial to the receptor model here is the inference that D1 receptor unbinding is rapid; but this inference is made based on the kinetics of dopamine sensors and is superficially explained - it is unclear why sensor kinetics should let us extrapolate to receptor kinetics, and unclear how safe is the extrapolation of the linear regression by an order of magnitude to get the D1 unbinding rate.”

      We chose to use the sensors because it was possible to estimate precise affinities/off-rates from the fluorescent measurements. Although there might some variation in affinities that could be attributable to the mutations introduced in the sensors, the data clearly separated D1R and D2R with a D1R affinity of ~1000 nM and a D2R affinity of ~7 nM (Labouesse & Patriarchi, 2021) consistent with earlier predictions of receptor affinities. From our assessment of the literature, we found that this was the most reasonable way to estimate affinities and thereby off-rates. Importantly, the model has been made publicly available, so should new measurements arise, the simulations can be rerun with tweaks to the input parameters. To address the concern, we have also expanded a bit on the logic applied in the updated manuscript (please see lines 177-178).

      Reviewing editor Comments : 

      The paper could benefit from a critical confrontation not only with existing modeling work as mentioned by the reviewers, but also with existing empirical data on pauses, D2 MSN excitability, and plasticity/learning.”

      We thank both the editor and the reviewers for their suggestions on how to improve the manuscript. We have incorporated further modelling on D1R and D2R response to pauses and bursts and expanded our discussion of the results in relation to existing evidence (please see our responses to the reviewers above and the revised text in the manuscript).

      Reviewer #1 (Recommendations for the authors): 

      “(1) Many figure panels are too small to read clearly - e.g. "cross-section over time" plots.”

      We agree with the reviewer and have increased the size of panels in several of the figures.

      (2) Supplementary Videos of the model in action might be useful (and fun to watch).”

      Great idea. We have generated videos of both bursts in the 3D projections and the resulting D1R and D2R occupancy in 2D. The videos are included as supplementary material as Videos S1 and S2 and referred to in the text of the revised manuscript.

      ” (3) Line 305: " Further, the cusp-like behaviour of Vmax in VS was independent of both Q and R%..." 

      It is not clear what the "cusp" refers to here.”

      We agree this is a confusing sentence. We have rewritten and eliminated the use of the vague “cusp” terminology in the manuscript.

      ” (4) Line 311: "We therefore reanalysed data from our previously published comparison of fibre photometry and microdialysis and found evidence of natural variations in the release-uptake balance of the mice (Figure 5F,G)" This figure seems to be missing altogether.”

      The manuscript missed “S” in the mentioned sentence to indicate a supplementary figure. We apologies for the confusion and have corrected the text.

      (5) Figure 1: 

      1b: need numbers on the color scale.”

      We have added numbers in the updated manuscript.

      ”1c: adding an earlier line (e.g. 2ms) could be helpful?”

      We have added a 2 ms line to aid the readers.

      ”1d: do the colors show DA concentration on the visible surfaces of the cube or some form of projection?”

      The colors show concentrations on the surface. We have expanded the text to clarify this.

      ”1e: is this "cross-section" a randomly-selected line (i.e. 1D) through the cube?”

      The cross-section is midway through the cube. We have clarified this in the text.

      ”1f: "density" misspelled.”

      We thank the reviewer for the keen eye. The error has been corrected.

      ”1g: color bars indicating stimulation time would be improved if they showed the individual stimulation pulses instead.”

      The burst is simulated as a Poisson distribution and individual pulses may therefore be misleading.

      ” Why does the burst simulation include all release sites in a 10x10x10µm cube? Please justify this parameter choice.

      1h: "1/10" - the "10" is meaningless for a single pulse, right?”

      Yes, we agree. 

      ”1i: is this the concentration for a single voxel? Or the average of voxels that are all 1µm from one specific release site?”

      Thank you for pointing out the confusing language. The figure is for a voxel containing a release site (with a voxel size of 1 um in diameter).

      The legend seems a bit different from the description in the main text ("within 1µm"). As it stands, I also can't tell whether the small DA peaks are related to that particular release site, or to others. 

      We have updated the text to clear up the confusing language.

      ” (6) Figure 2: 

      2h: I'm not sure that the "relative occupancy" normalized measure is the most helpful here.”

      We believe the figure aids to illustrate the sphere of influence on receptors from a single burst is greater in VS than DS, suggesting DS can process information with tighter spatial control. Using a relative measure allows for more accessible comparison of the sphere of influence in a single figure. 

      ” (7) Figure 3: 

      The schematics need improvement.

      3a – would be more useful if it corresponded better to the actual simulation (e.g. we had a spatial scale shown). 

      3d – is this really useful, given the number of molecules shown is so much lower than in the simulation? 

      3h, 3j – need more explanation, e.g. axis labels. ”

      The schematics are intended to quickly inform the readers what parameters are tuned in the following figures, and not to be exact representations. However, we agree Figures 3h and 3j need axis labels, and we have accordingly added these.

      (8) Figure 4: 

      4m, n were not clearly explained. 

      We agree and have elaborated the explanation of these figures in the manuscript (lines 374-377.

      ” (9) From Figure S1 it appears that the definition of "DS" and "VS" used is above and below the anterior commissure, respectively. This doesn't seem reasonable - many if not most studies of "VS" have examined the nucleus accumbens core, which extends above the anterior commissure. Instead, it seems like the DAT expression difference observed is primarily a difference between accumbens Shell and the rest of the striatum, rather than DS vs VS.”

      We assume that the reviewer refers to Figure S3 and not S1. First, we would like to highlight that we had mislabeled VMAT2 and DAT in Figure S3C (now corrected). Apologies for the confusion. Second, as for striatal subregions, we have intentionally not distinguished between different subregions of the ventral striatum. The majority of literature we base our parameters on do not specify between e.g., NAcC vs. NAcS or DLS vs. DMS. The four slices we examined in Figure 3A-C were not perfectly aligned in the accumbal region, and we therefore do not believe we can draw any conclusions between core and shell.

      Reviewer #2 (Recommendations for the authors): 

      (1) Modelling assumptions: 

      The burst activity simulations seem conceptually flawed. How were release sites assigned to the 150 neurons? The burst activity simulations such as Figure 1g show a spatially localised release, but this means either (1) the release sites for one DA neuron are all locally clustered, or (2) only some release sites for each DA neuron are receiving a burst of APs, those release sites are close together, and the DA neurons' other release sites are not receiving the burst. Either way, this is not plausible.”

      We apologize for the confusion; however, we disagree that the simulations seem conceptually flawed. It is important to note that the burst simulation is spatially restricted to investigate local DA dynamics and how well different parts of the striatum can gate spill-over and receptor activation. The conditions may mimic local action potentials generated by nicotinic receptor activation (see e.g. Liu et al. Science 2022 or Matityahu et al, Nature Comm 2023), We have accordingly expanded on this is the manuscript on lines 148-151.

      (2) Data and its reporting: 

      Comparison to May and Wightman data: if we're meant to compare DS and VS concentrations, then plot them together; what were the experimental results (just says "closely resembled the earlier findings")?”

      Unfortunately, the quantitative values of the May and Wightman (1989) data are not publicly available. We are therefore limited to visual comparison and cannot replot the values.

      ” Figures S3b and c do not agree: Figure S3b shows DAT staining dropping considerably in VS; Fig 3c does not, and neither do the quoted statistics.”

      We had accidentally mixed up the labels in Figure S3c. Thank you for spotting this. We have corrected this in the updated manuscript.

      ” How robust are the results of simulations of the same parameter set? Figures S3D and E imply 5 simulations per burst paradigm, but these are not described.”

      The bursts are simulated with a Poisson distribution as described in Methods under Three-dimensional finite difference model. This induces a stochastic variation in the simulations that mimics the empirical observations (see Dreyer et al., J. Neurosci., 2010).

      ” I found it rather odd that the robustness of the receptor binding results is not checked across the changes in model parameters. This seems necessary because most of the changes, such as increasing the quantal release or the number of sites, will obviously increase dopamine concentration, but they do not necessarily meaningfully increase receptor activation because of saturation (and, in more complex receptor binding models, because of the number of available receptors).”

      This is an excellent point. However, we decided not to address this in the present study as we would argue that such additional simulations are not a necessity for our main conclusions. Instead, we decided in the revised version to focus on simulations mirroring a range of different receptor affinities as described in detail above. 

      ” Figure 4H: how can unclustered simulations have a different concentration at the centre of a "cluster" than outside, when the uptake is homogenous? Why is clustering of DAT "efficient"? [line 359]”

      This is a great observation. The drop is compared to the average of the simulation space. Despite no clusters, the uniform scenario still has a concentration gradient towards the surface of the varicosity. We have elaborated on this in the manuscript on lines 346-349.

      ” The Discussion conclusions about what D1Rs and D2Rs cannot track are not tested in the paper (e.g. ramps). Either test them or make clear what is speculation.”

      An excellent point that some of the claims in the discussion were not fully supported. We have added a simulation with a chain of burst firings to highlight how the temporal integration differs between the two receptors and updated the wording in the discussion to exclude ramps as this was not explicitly tested. See lines 191-193 and Figure S1G.

      ” (3) Organisation of paper: 

      Consistency of terminology. These terms seem to be used to describe the same thing, but it is unclear if they are: release sites, active terminals (Table 1), varicosity density. Likewise: release probability, release fraction.”

      Thank you for pointing this out. We have revised the manuscript and cleared up terminology on release sites. However, release probability and release-capable fraction of varicosities are two separate concepts.

      ” The references to the supplementary figure are not in sequence, and the panels assigned to the supplemental figures seem arbitrary in what is assigned to each figure and their ordering. As Figures 1 and 2 are to be directly compared, so plot the same results in each. Figure S1F is discussed as a key result, but is in a supplemental figure. ”

      Thank you for identifying this. We have updated figure references and further moved Figure S1F into the main as we agree this is a main finding.

      ” The paper frequently reads as a loose collection of observations of simulations. For example, why look at the competitive inhibition of DA by cocaine [Fig 3H-I]? The nanoclustering of DAT (Figure 4) seems to be partial work from a different paper - it is unclear why the Vmax results warrant that detailed treatment here, especially as no rationale is offered for why we would want Vmax to change.”

      We apologize if the paper reads as a loose collection of observations of simulations. This is certainly not the case. As for the cocaine competition, we used this because this modulates the Km value for DA and because we wanted to examine how dependent the dopamine dynamics are to changing different parameters in the model (Km in this case). We noticed Vmax had a separate effect between DS and VS. Accordingly, we gave it particular focus because it is physiological parameter than be modified and, if modified, it can have potential large impact on striatal DA dynamics.  Importantly, it is well known that the DA transporter (DAT) is subject to cellular regulation of its surface expression e.g. by internalization /recycling and thereby of uptake capacity (Vmax). Furthermore, we demonstrate in the present study evidence that uptake capacity on a much faster time scale can be modulated by nanoclustering, which posits a potentially novel type of synaptic plasticity. We find this rather interesting and decided therefore to focus on this in the manuscript. 

      ” What are the axes in Figure 3H and Figure 3J?”

      We have updated the figures to include axis. Thank you for pointing out this omission.

      ” Much is made of the sensitivity to Vmax in VS versus DS, but this was hard work to understand. It took me a while to work out that Figure 3K was meant to indicate the range of Vmax that would be changed in VS and DS respectively. "Cusp-like behaviour" (line 305) is unclear.”

      We agree that the original language was unclear – including the terminology “cusplike behavior”. We have updated the description and cut the confusion terminology. See line 366.

      ” The treatment of highly relevant prior work, especially that of Hunger et al 2020 and Dreyer et al (2010, 2014), is poor, being dismissed in a single paragraph late in the Discussion rather than explicating how the current paper's results fit into the context of that work. The authors may also want to discuss the anticipation of their conclusions by Wickens and colleagues, including dopamine hotspots (https://doi.org/10.1016/j.tins.2006.12.003) and differences between DS and VS dopamine release (https://doi.org/10.1196/annals.1390.016).”

      We thank the reviewer for the suggested discussion points and have included and discussed references to the work by Wickens and colleagues (see lines 407-411 and 418-420).

      ” (4) Methods: 

      Clarify the FSCV simulations: the function I_FSCV was convolved with the simulated [DA] signal?”

      Yes. We have clarified this in the method section on lines 593-594.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The study by Gupta et al. investigates the role of mast cells (MCs) in tuberculosis (TB) by examining their accumulation in the lungs of M. tuberculosis-infected individuals, non-human primates, and mice. The authors suggest that MCs expressing chymase and tryptase contribute to the pathology of TB and influence bacterial burden, with MC-deficient mice showing reduced lung bacterial load and pathology. 

      Strengths: 

      (1) The study addresses an important and novel topic, exploring the potential role of mast cells in TB pathology. 

      (2) It incorporates data from multiple models, including human, non-human primates, and mice, providing a broad perspective on MC involvement in TB. 

      (3) The finding that MC-deficient mice exhibit reduced lung bacterial burden is an interesting and potentially significant observation. 

      Weaknesses: 

      (1) The evidence is inconsistent across models, leading to divergent conclusions that weaken the overall impact of the study. 

      The strength of the study is the use of multiple models including mouse, nonhuman primate as well as human samples. The conclusions have now been refined to reflect the complexity of the disease and the use of multiple models.

      (2) Key claims, such as MC-mediated cytokine responses and conversion of MC subtypes in granulomas, are not well-supported by the data presented.

      To address the reviewer’ s comments we will carry out further experimentation to strengthen the link between MC subtypes and cytokine responses. 

      (3) Several figures are either contradictory or lack clarity, and important discrepancies, such as the differences between mouse and human data, are not adequately discussed. 

      We will further clarify the figures and streamline the discussions between the different models used in the study. 

      (4) Certain data and conclusions require further clarification or supporting evidence to be fully convincing. 

      We will either provide clarification or supporting evidence for some of the key conclusions in the paper. 

      Reviewer #2 (Public review): 

      Summary: 

      The submitted manuscript aims to characterize the role of mast cells in TB granuloma. The manuscript reports heterogeneity in mast cell populations present within the granulomas of tuberculosis patients. With the help of previously published scRNAseq data, the authors identify transcriptional signatures associated with distinct subpopulations. 

      Strengths: 

      (1) The authors have carried out a sufficient literature review to establish the background and significance of their study. 

      (2) The manuscript utilizes a mast cell-deficient mouse model, which demonstrates improved lung pathology during Mtb infection, suggesting mast cells as a potential novel target for developing host-directed therapies (HDT) against tuberculosis. 

      Weaknesses: 

      (1) The manuscript requires significant improvement, particularly in the clarity of the experimental design, as well as in the interpretation and discussion of the results. Enhanced focus on these areas will provide better coherence and understanding for the readers. 

      The strength of the study is the use of multiple models including mouse, nonhuman primate as well as human samples. The conclusions have now been refined to reflect the complexity of the disease and the use of multiple models.

      (2) Throughout the manuscript, the authors have mislabelled the legends for WT B6 mice and mast cell-deficient mice. As a result, the discussion and claims made in relation to the data do not align with the corresponding graphs (Figure 1B, 3, 4, and S2). This discrepancy undermines the accuracy of the conclusions drawn from the results. 

      We apologize for the discrepancy which will be corrected in the revised manuscript 

      (3) The results discussed in the paper do not add a significant novel aspect to the field of tuberculosis, as the majority of the results discussed in Figure 1-2 are already known and are a re-validation of previous literature.

      This is the first study which has used mouse, NHP and human TB samples from Mtb infection to characterize and validate the role of MC in TB. We believe the current study provides significant novel insights into the role of MC in TB. 

      (4) The claims made in the manuscript are only partially supported by the presented data. Additional extensive experiments are necessary to strengthen the findings and enhance the overall scientific contribution of the work.

      We will either provide clarification or supporting evidence for some of the key conclusions in the paper.

      Reviewer #1 (Recommendations for the authors):

      In the study by Gupta et al., the authors report an accumulation of mast cells (MCs) expressing the proteases chymase and tryptase in the lungs of M. tuberculosis-infected individuals and non-human primates, as compared to healthy controls and latently infected individuals. They also MCs appear to play a pathological role in mice. Notably, MC-deficient mice show reduced lung bacterial burden and pathology during infection.

      While the topic is of interest, the study is overall quite preliminary, and many conclusions are not wellsupported by the presented data. The reliance on three different models, each suggesting divergent outcomes, weakens the ability to draw definitive conclusions. Specifically, the claim that "MCs (...) mediate cytokine responses to drive pathology and promote Mtb susceptibility and dissemination during TB" is not substantiated by the data.

      Major comments

      (1) In human samples, the authors conclude that "While MCTCs accumulated in early immature granulomas within TB lesions, MCCs accumulated in late granulomas in TB patients" and that MCTs "likely convert first to MCTCs in early granulomas before becoming MCCs in late mature granulomas with necrotic cores." However, Figure 1B shows the opposite. Furthermore, the assertion that MCTs "convert" into MCTCs is not justified by the data.

      Corrections have been made to the figures to ensure clarity for the reader. We demonstrate accumulation of tryptase-expressing MCs in healthy individuals, while the dual tryptase and chymaseexpressing MCs were seen in early granulomas, and only chymase-associated MCs were observed in late granulomas depicting more pathology of the disease. We have removed the line as advised by the reviewer.

      (2) In Figure 2 I and J, the panels do not demonstrate co-expression of chymase and tryptase in clusters 0, 1, and 3 in PTB samples, which contradicts the histology data. This discrepancy is left unaddressed and raises concerns about the conclusions drawn from Figures 1 and 2.

      We thank the reviewer for pointing this out. We revisited the data and now show the coexpression of the dual expressing cells in the data (Figure 2H). This discrepancy stemmed from the crossspecies nature of the dataset. It turns out the there is a considerable diversity in sequence similarity and tryptase function between human and NHPs (Trivedi et al., 2007). We explain this in the section now (line 313-364). Briefly, while humans express TPSG1 (encoding  tryptase) and TPSD1 (encoding  tryptase) and have the same gene name in NHP, the gene name for more widely expressed TPSAB1(encoding  /  tryptase) is different for NHP and the gene names are not shared as they are still predicated putative protein. The putative genes from NHP that map to human TPSAB1 is LOC699599 for M. mulatta and LOC102139613 for M. fasicularis, respectively. Thus, looking for TPSAB1 gene yielded no result in our previous analysis but examining these orthologous gene names, now phenocopy the results we see in the histology data. To strengthen our findings, we have now analyzed an additional single-cell dataset from the lungs of NHP M. fasicularis (Figure 2J-L) and found the co-expression of chymase and tryptase, adding an important validation to our histological findings.

      (3) Figure 2 serves more as a resource and contributes little to the core findings of the study. It might be better suited as supplementary material.

      We thank the reviewer for the suggestion; however, we believe that Figure 2 serves as an independent validation in a different species (NHP), showing heterogeneity in MCs across species in a TB model. The figure adds value as there are only a handful of studies (Tauber et al., 2023, Derakhshan et al., 2022, Cildir et al., 2021) but none in TB, describing MCs at single cell level, of which one is published from our group showing MC cluster in Mtb infected macaques (Esaulova et al., 2021). We feel strongly that dissecting MCs as specifically done here provides an important insight into the transcriptional heterogeneity of these cells linked to disease states. We have also added an additional NHP lung single cell dataset (Gideon et al., 2022) to complement our analysis, thus adding another validation, strengthening these findings. So, we believe in retaining the figure as an integral part of the main paper.

      (4) In lines 275-277, the data referenced should be shown to support the claims.

      We thank the reviewer for the suggestion. The text originally noted by the reviewer now appears in the revised manuscript at line 370-372 and the corresponding data has now been included as supplementary Figure S3. 

      (5) In Figure 3B, the difference between the two mouse strains becomes non-significant by day 150 pi, weakening the overall conclusion that MCs contribute to the bacterial burden.

      At 100 dpi, MC-deficient mice exhibit lower Mtb CFU in both the lung and spleen, indicating improved protection. By 150 dpi, lung CFU differences are no longer significant; however, dissemination to the spleen remains reduced in MC-deficient mice. Thus, the overall conclusion that MCs contribute to increased bacterial burden remains valid, particularly with respect to dissemination. This conclusion is further supported by new data showing that adoptive transfer of MCs into B6 Mtb-infected mice increased Mtb dissemination to the spleen (Figure 5E). 

      (6) Figures 3D and E are not particularly convincing.

      Figures 3D and 3E illustrate lung inflammation in MC-deficient mice compared to wild-type which more distinctly show that MC-deficient mice exhibit significantly less inflammation at 150 dpi, supporting the role of MCs in driving lung.

      (7) In Figures 4 and S3, the color coding in panels A-F appears incorrect but is accurate in G. This inconsistency is confusing.

      We thank the reviewer for noting this. The color coding has been corrected to ensure consistency across all figures.

      (8) In the mouse model, MCs seem to disappear during infection, in contrast to observations in human and macaque samples. This discrepancy is not discussed in the paper.

      We thank the reviewer for this important observation. In response, we performed a new analysis of lung MCs at baseline in wild-type and MC-deficient mice. Our data show that naïve wild-type lungs contain a small population of MCs, which is further reduced in MC-deficient mice. Following Mtb infection, MCs progressively accumulate in wild-type mice, whereas this accumulation is significantly impaired in MC-deficient mice. These new data are now included in Figure (Figure 4A) and also updated in the text (line 395-403).

      (9) In lines 306-307, data should be shown to support the claims.

      We thank the reviewer for the suggestion. The text originally noted by the reviewer now appears in the revised manuscript at line 399-400 and the corresponding data has now been included as supplementary Figure S4. 

      Minor comments

      (1) What does "granuloma-associated" cells mean in samples from healthy controls?

      We thank the reviewer for this point. The language has been revised to accurately refer to cells in the lung parenchyma in the Figure 1, rather than “granuloma associated” cells.

      (2) In line 229, it is unclear what "these cells" refers to.

      The phrase “these cells” refers to tryptase-expressing mast cells. This has now been clarified in the revised manuscript (line 276-277).

      (3) The citation of Figure 3A in lines 284-285 is misplaced in the text and should be corrected.

      The figure citation has been corrected in the text in the revised manuscript (lines 376-379).

      Reviewer #2 (Recommendations for the authors):

      (1) The data presented in Figure 1 seems to be a re-validation of the already known aspects of mast cells in TB granulomas. While distinct roles for mast cells in regulating Mtb infection have been reported, the manuscript appears to be a failed opportunity to characterize the transcriptional signatures of the distinct subsets and identify their role in previously reported processes towards controlling TB disease progression.

      We thank the reviewer for the insight. While it was not our intent to investigate the bulk transcriptome, owing to the high number of cells required to get enough RNA for transcriptomic sequencing, it is technically challenging due to the low abundance of mast cells during TB infection (Figure 2). The motivation for Figure 2, that we utilized a more sensitive transcriptomic analysis to find the different transcriptional states in the distinct TB disease states. We believe that this analysis captures the essence of what the reviewer and provides meaningful insights into mast cell heterogeneity during TB.

      (2) The experiments lack uniformity with respect to the strains of Mtb used for experimentation. For eg: Mtb strain HN878 was used for aerosol infection of mice while Mtb CDC1551 was used for macaques. If there were experimental constraints with respect to the choice, the same should be mentioned.

      We thank the reviewer for this comment. The Mtb strain usage has been consistent within each species: HN878 for mice and CDC1551 for non-human primates (NHPs), in line with prior studies from our lab. The species-specific choice reflects the differences in pathogenicity of these strains in mice versus NHPs. CDC1551, which exhibits lower virulence, allows the development of a macaque model that recapitulates human latent to chronic TB when administered via aerosol at low to moderate doses (Kaushal et al., 2015; Sharan et al., 2021; Singh et al., 2025). In contrast, the more virulent HN878 strain leads to severe disease and high mortality in NHPs and is therefore not suitable for these models. Using CDC1551 in macaques provides a controlled and clinically relevant platform to study immunological and pathophysiological mechanisms of TB, justifying its use in the current study. This explanation has now been added to the manuscript method section (lines 109-114).

      (3) Line 84- 85, the authors state that "Chymase positive MCs contribute to immune pathology and reduced Mtb control". Previous reports including Garcia-Rodriguez et al., 2021 associate high MCTCs with improved lung function. Additionally, in the macaques model of latent TB infection reported in the manuscript, the number of chymase-expressing MCs seems to significantly decrease. The authors should justify the same. 

      We thank the reviewer for this comment. In Garcia-Rodriguez et al., 2021, chymase-expressing MCs accumulate in fibrotic lung lesions. Fibrosis is a result of excessive inflammation in TB infection and is associated with lung damage. Similarly, in idiopathic pulmonary fibrosis, higher density and percentage of chymase-expressing MCs correlate positively with fibrosis severity (Andersson et al., 2011). In our study, although fibrosis was not directly assessed, chymase-positive MCs increased in late lung granulomas, consistent with advanced inflammatory disease. Therefore, our conclusion that chymaseproducing MCs contribute to lung pathology is justified and aligns with prior observations.

      (4) The manuscript would benefit from a brief description of the experimental conditions for the previously published scRNAseq data used in the current study.

      We thank the reviewer for the suggestion, and the information has been included in the final manuscript (lines 294-297) and represented as Figure 2A.

      (5) The authors have not mentioned the criteria used to categorize early and late granulomas in TB patients. A lucid description of the same is necessary.

      Based on reviewer’s comment the detailed categorization of early and late granulomas in TB patients is now included in the revised manuscript (line 256-260). Early granulomas: Discrete conglomerates of immune cells and resident stromal cells with defined borders and absence of central necrosis, and Late granulomas: Large and dense clusters of immune cells and resident cells with an evident necrotic center containing bacteria and dead neutrophils and lymphocytic infiltrating cells on the periphery of the necrotic center. MCs were measured in the periphery and inside early granulomas, while in the late granulomas, they were mainly quantified in the periphery.

      (6) The authors mention that "While MCTCs accumulated in early immature granulomas within TB lesions, MCCs accumulated in late granulomas in TB patients". While this is evident from the representative, the quantification in Figure 1B seems to indicate otherwise.

      We thank the reviewer for pointing this out. The labeling in the quantitative analysis shown in Figure 1B has been corrected in the revised manuscript to accurately reflect the accumulation of MC<sub>TC</sub>s in early granulomas and MC<sub>C</sub>s in late granulomas.

      (7) The labelling followed in Figures 3, 4 and S2 do not match with the discussion. Such errors should be rectified to minimize any ambiguity within the text of the manuscript.

      We thank the reviewer for noting this. The color coding has been corrected to ensure consistency across all figures.

      (8) The mast cell deficient mice model has a differential number of immune cells at the site of granuloma as reported in the manuscript. This could contribute to the altered mycobacterial survival and inflammation cytokine production in the lung and hence might not be a direct effect of mast cell depletion. The authors can consider reconstituting mast cell populations to analyze the mast cell function.

      We thank the reviewers for this suggestion. In the revised manuscript, we have adoptively transferred MCs into WT mice before Mtb challenge to assess if this would increase inflammation and Mtb CFU in the lung and spleen. Our results show that while lung inflammation was not impacted, we found that the dissemination to the spleen and the frequency of neutrophils in the lung were increased in WT mice that received MCs (Figure 5, lines 429-443).

      (9) Line 295- 297, the authors state "MCs continued to accumulate in the lung up to 100 dpi in CgKitWsh mice, following which the MC numbers decreased at later stages". However, the quantification in Figure 4A does not reflect the same. This should be addressed.

      In response to the reviewers' comments, we conducted a new analysis of lung MCs at baseline, comparing wild-type and MC-deficient mice. The revised data show that MC-deficient mice have fewer mast cells at baseline compared to B6 mice. Furthermore, mast cell numbers increase during infection, peaking at 100 days post-infection (dpi) and subsequently stabilize by 150 dpi. The revised data has been included in Figure 4A and text line 395-403.

      (10) Additionally, while the scRNAseq data reflects a lower production of TNF in pulmonary TB granulomas, the mice deficient in mast cells are discussed to have a lower production of proinflammatory cytokines.

      Mast cells increasing and contributing to the TB pathogenesis is the theme of the paper and as such we see and increase in the IFNG pathway genes and similar reduction in the production of pro- inflammatory cytokines. The relative decrease in the TNF pathway gene expression can be reconciled by the fact that less TNF gene expression in PTB could also represent loss of Mtb control and increased pathogenesis (Yuk et al., 2024), which is maintained in the LTBI/HC clusters. Higher bacterial burden of Mtb can also decrease the host TNF production, which is in line with what we observe here (Olsen et al., 2016, Reed et al., 2004, Kurtz et al., 2006).

      (11) The authors have not annotated Figure 2 I and J in the text while describing their results and interpretation.

      We thank the reviewer for noting this and the figure 2 has been revised and the results as pointed out have been added to the revised manuscript.

      (12) In line 284, the authors have discussed the results pertaining to Figure 3B, however, mentioned it as Figure 3A in the text.

      We thank the reviewer for noting this and the corrections have been made in the revised manuscript (lines 379-384).

      References

      ANDERSSON, C. K., ANDERSSON-SJOLAND, A., MORI, M., HALLGREN, O., PARDO, A., ERIKSSON, L., BJERMER, L., LOFDAHL, C. G., SELMAN, M., WESTERGREN-THORSSON, G. & ERJEFALT, J. S. 2011. Activated MCTC mast cells infiltrate diseased lung areas in cystic fibrosis and idiopathic pulmonary fibrosis. Respir Res, 12, 139.

      CILDIR, G., YIP, K. H., PANT, H., TERGAONKAR, V., LOPEZ, A. F. & TUMES, D. J. 2021. Understanding mast cell heterogeneity at single cell resolution. Trends Immunol, 42, 523-535.

      DERAKHSHAN, T., BOYCE, J. A. & DWYER, D. F. 2022. Defining mast cell differentiation and heterogeneity through single-cell transcriptomics analysis. J Allergy Clin Immunol, 150, 739-747.

      ESAULOVA, E., DAS, S., SINGH, D. K., CHORENO-PARRA, J. A., SWAIN, A., ARTHUR, L., RANGEL-MORENO, J., AHMED, M., SINGH, B., GUPTA, A., FERNANDEZ-LOPEZ, L. A., DE LA LUZ GARCIA-HERNANDEZ, M., BUCSAN, A., MOODLEY, C., MEHRA, S., GARCIA-LATORRE, E., ZUNIGA, J., ATKINSON, J., KAUSHAL, D., ARTYOMOV, M. N. & KHADER, S. A. 2021. The immune landscape in tuberculosis reveals populations linked to disease and latency. Cell Host Microbe, 29, 165-178 e8.

      GARCIA-RODRIGUEZ, K. M., BINI, E. I., GAMBOA-DOMINGUEZ, A., ESPITIA-PINZON, C. I., HUERTA-YEPEZ, S., BULFONE-PAUS, S. & HERNANDEZ-PANDO, R. 2021. Differential mast cell numbers and characteristics in human tuberculosis pulmonary lesions. Sci Rep, 11, 10687.

      GIDEON, H. P., HUGHES, T. K., TZOUANAS, C. N., WADSWORTH, M. H., 2ND, TU, A. A., GIERAHN, T. M., PETERS, J. M., HOPKINS, F. F., WEI, J. R., KUMMERLOWE, C., GRANT, N. L., NARGAN, K., PHUAH, J. Y., BORISH, H. J., MAIELLO, P., WHITE, A. G., WINCHELL, C. G., NYQUIST, S. K., GANCHUA, S. K. C., MYERS, A., PATEL, K. V., AMEEL, C. L., COCHRAN, C. T., IBRAHIM, S., TOMKO, J. A., FRYE, L. J., ROSENBERG, J. M., SHIH, A., CHAO, M., KLEIN, E., SCANGA, C. A., ORDOVAS-MONTANES, J., BERGER, B., MATTILA, J. T., MADANSEIN, R., LOVE, J. C., LIN, P. L., LESLIE, A., BEHAR, S. M., BRYSON, B., FLYNN, J. L., FORTUNE, S. M. & SHALEK, A. K. 2022. Multimodal profiling of lung granulomas in macaques reveals cellular correlates of tuberculosis control. Immunity, 55, 827846 e10.

      KAUSHAL, D., FOREMAN, T. W., GAUTAM, U. S., ALVAREZ, X., ADEKAMBI, T., RANGEL-MORENO, J., GOLDEN, N. A., JOHNSON, A. M., PHILLIPS, B. L., AHSAN, M. H., RUSSELL-LODRIGUE, K. E., DOYLE, L. A., ROY, C. J., DIDIER, P. J., BLANCHARD, J. L., RENGARAJAN, J., LACKNER, A. A., KHADER, S. A. & MEHRA, S. 2015. Mucosal vaccination with attenuated Mycobacterium tuberculosis induces strong central memory responses and protects against tuberculosis. Nat Commun, 6, 8533.

      KURTZ, S., MCKINNON, K. P., RUNGE, M. S., TING, J. P. & BRAUNSTEIN, M. 2006. The SecA2 secretion factor of Mycobacterium tuberculosis promotes growth in macrophages and inhibits the host immune response. Infect Immun, 74, 6855-64.

      OLSEN, A., CHEN, Y., JI, Q., ZHU, G., DE SILVA, A. D., VILCHEZE, C., WEISBROD, T., LI, W., XU, J., LARSEN, M., ZHANG, J., PORCELLI, S. A., JACOBS, W. R., JR. & CHAN, J. 2016. Targeting Mycobacterium tuberculosis Tumor Necrosis Factor Alpha-Downregulating Genes for the Development of Antituberculous Vaccines. mBio, 7.

      REED, M. B., DOMENECH, P., MANCA, C., SU, H., BARCZAK, A. K., KREISWIRTH, B. N., KAPLAN, G. & BARRY, C. E., 3RD 2004. A glycolipid of hypervirulent tuberculosis strains that inhibits the innate immune response. Nature, 431, 84-7.

      SHARAN, R., SINGH, D. K., RENGARAJAN, J. & KAUSHAL, D. 2021. Characterizing Early T Cell Responses in Nonhuman Primate Model of Tuberculosis. Front Immunol, 12, 706723.

      SINGH, D. K., AHMED, M., AKTER, S., SHIVANNA, V., BUCSAN, A. N., MISHRA, A., GOLDEN, N. A., DIDIER, P. J., DOYLE, L. A., HALL-URSONE, S., ROY, C. J., ARORA, G., DICK, E. J., JR., JAGANNATH, C., MEHRA, S., KHADER, S. A. & KAUSHAL, D. 2025. Prevention of tuberculosis in cynomolgus macaques by an attenuated Mycobacterium tuberculosis vaccine candidate. Nat Commun, 16, 1957.

      TAUBER, M., BASSO, L., MARTIN, J., BOSTAN, L., PINTO, M. M., THIERRY, G. R., HOUMADI, R., SERHAN, N., LOSTE, A., BLERIOT, C., KAMPHUIS, J. B. J., GRUJIC, M., KJELLEN, L., PEJLER, G., PAUL, C., DONG, X., GALLI, S. J., REBER, L. L., GINHOUX, F., BAJENOFF, M., GENTEK, R. & GAUDENZIO, N. 2023. Landscape of mast cell populations across organs in mice and humans. J Exp Med, 220.

      TRIVEDI, N. N., TONG, Q., RAMAN, K., BHAGWANDIN, V. J. & CAUGHEY, G. H. 2007. Mast cell alpha and beta tryptases changed rapidly during primate speciation and evolved from gamma-like transmembrane peptidases in ancestral vertebrates. J Immunol, 179, 6072-9.

      YUK, J. M., KIM, J. K., KIM, I. S. & JO, E. K. 2024. TNF in Human Tuberculosis: A Double-Edged Sword. Immune Netw, 24, e4.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review): 

      Summary:

      The authors of this study sought to define a role for IgM in responses to house dust mites in the lung. 

      Strengths: 

      Unexpected observation about IgM biology 

      Combination of experiments to elucidate function 

      Weaknesses: 

      Would love more connection to human disease 

      We thank the reviewer for these comments. At the time of this publication, we have not made a concrete link with human disease. While there is some anecdotal evidence of diseases such as Autoimmune glomerulonephritis, Hashimoto’s thyroiditis, Bronchial polyp, SLE, Celiac disease and other diseases in people with low IgM. Allergic disorders are also common in people with IgM deficiency, other studies have reported as high as 33-47%. The mechanisms for the high incidence of allergic diseases are unclear as generally, these patients have normal IgG and IgE levels. IgM deficiency may represent a heterogeneous spectrum of genetic defects, which might explain the heterogeneous nature of disease presentations.   

      Reviewer #2 (Public Review): 

      Summary: 

      The manuscript by Hadebe and colleagues describes a striking reduction in airway hyperresponsiveness in Igm-deficient mice in response to HDM, OVA and papain across the B6 and BALB-c backgrounds. The authors suggest that the deficit is not due to improper type 2 immune responses, nor an aberrant B cell response, despite a lack of class switching in these mice. Through RNA-Seq approaches, the authors identify few di]erences between the lungs of WT and Igm-deficient mice, but see that two genes involved in actin regulation are greatly reduced in IgM-deficient mice. The authors target these genes by CRISPR-Cas9 in in vitro assays of smooth muscle cells to show that these may regulate cell contraction. While the study is conceptually interesting, there are a number of limitations, which stop us from drawing meaningful conclusions. 

      Strengths:

      Fig. 1. The authors clearly show that IgMKO mice have striking reduced AHR in the HDM model, despite the presence of a good cellular B cell response. 

      Weaknesses: 

      Fig. 2. The authors characterize the cd4 t cell response to HDM in IGMKO mice.They have restimulated medLN cells with antiCD3 for 5 days to look for IL-4 and IL-13, and find no discernible di]erence between WT and KO mice. The absence of PBStreated WT and KO mice in this analysis means it is unclear if HDM-challenged mice are showing IL-4 or IL-13 levels above that seen at baseline in this assay. 

      We thank the Reviewer for this comment. We would like to mention that a very minimal level of IL-4 and IL-13 in PBS mice was detected. We have indicated with a dotted line on the Figure 2B to show levels in unstimulated or naïve cytokines. Please see Author response image 1 below from anti-CD3 stimulated cytokine ELISA data. The levels of these cytokines are very low (not detectable) and are not changed in control WT and IgM- KO mice challenge with PBS, this is also true for PMA/ionomycin-stimulated cells

      Author response image 1.

      The choice of 5 days is strange, given that the response the authors want to see is in already primed cells. A 1-2 day assay would have been better. 

      We agree with the reviewer that a shorter stimulation period would work. Over the years we have settled for 5-day re-stimulation for both anti-CD3 and HDM. We have tried other time points, but we consistently get better secretion of cytokines after 5 days. 

      It is concerning that the authors state that HDM restimulation did not induce cytokine production from medLN cells, since countless studies have shown that restimulation of medLN would induce IL-13, IL-5 and IL-10 production from medLN. This indicates that the sensitization and challenge model used by the authors is not working as it should. 

      We thank the reviewer for this observation. In our recent paper showing how antigen load a]ects B cell function, we used very low levels of HDM to sensitise and challenge mice (1 ug and 3 ug respectively). See below article, Hadebe et al., 2021 JACI. This is because Labs that have used these low HDM levels also suggested that antigen load impacts B cell function, especially in their role in germinal centres. We believe the reason we see low or undetectable levels of cytokines is because of this low antigen load sensitisation and challenge. In other manuscripts we have published or about to publish, we have shown that normal HDM sensitisation load (1 ug or 100 ug) and challenge (10 ug) do induce cytokine release upon restimulation with HDM. See the below article by Khumalo et al, 2020 JCI Insight (Figure 4A).

      Sabelo Hadebe*, Jermaine Khumalo, Sandisiwe Mangali, Nontobeko Mthembu, Hlumani Ndlovu, Amkele Ngomti, Martyna Scibiorek, Frank Kirstein, Frank Brombacher*. Deletion of IL-4Ra signalling on B cells limits hyperresponsiveness depending on antigen load. doi.org/10.1016/j.jaci.2020.12.635).

      Jermaine Khumalo, Frank Kirstein, Sabelo Hadebe*, Frank Brombacher*. IL-4Rα signalling in regulatory T cells is required for dampening allergic airway inflammation through inhibition of IL-33 by type 2 innate lymphoid cells. JCI Insight. 2020 Oct 15;5(20):e136206. doi: 10.1172/jci.insight.136206

      The IL-13 staining shown in panel c is also not definitive. One should be able to optimize their assays to achieve a better level of staining, to my mind. 

      We agree with the reviewer that much higher IL-13-producing CD4 T cells should be observed. We don’t think this is a technical glitch or non-optimal set-up as we see much higher levels of IL-13-producing CD4 T cells when using higher doses of HDM to sensitise and challenge, say between 7 -20% in WT mice (see Author response image 2 of lung stimulated with PMA/ionomycin+Monensin, please note this is for illustration purposes only and it not linked to the current manuscript, its merely to demonstrate a point from other experiments we have conducted in the lab).

      Author response image 2.

      In d-f, the authors perform a serum transfer, but they only do this once. The half life of IgM is quite short. The authors should perform multiple naïve serum transfers to see if this is enough to induce FULL AHR. 

      We thank the reviewer for this comment. We apologise if this was not clear enough on the Figure legend and method, we did transfer serum 3x, a day before sensitisation, on the day of sensitisation and a day before the challenge to circumvent the short life of IgM. In our subsequent experiments, we have now used busulfan to deplete all bone marrow in IgM-deficient mice and replace it with WT bone marrow and this method restores AHR (Figure 3B).

      This now appears in line 515 to 519 and reads

      Adoptive transfer of naïve serum

      Naïve wild-type mice were euthanised and blood was collected via cardiac puncture before being spun down (5500rpm, 10min, RT) to collect serum. Serum (200µL) was injected intraperitoneally into IgM-deficient mice. Serum was injected intraperitoneally at day -1, 0, and a day before the challenge with HDM (day 10).

      The presence of negative values of total IgE in panel F would indicate some errors in calculation of serum IgE concentrations. 

      We thank the reviewer for this observation. For better clarity, we have now indicated these values as undetected in Figure 2F, as they were below our detection limit.

      Overall, it is hard to be convinced that IgM-deficiency does not lead to a reduction in Th2 inflammation, since the assays appear suboptimal. 

      We disagree with the reviewer in this instance, because we have shown in 3 di]erent models and in 2 di]erent strains and 2 doses of HDM (high and low) that no matter what you do, Th2 remains intact. Our reason for choosing low dose HDM was based on our previous work and that of others, which showed that depending on antigen load, B cells can either be redundant or have functional roles. Since our interest was to tease out the role of B cells and specifically IgM, it was important that we look at a scenario where B cells are known to have a function (low antigen load). We did find similar findings at high dose of HDM load, but e]ects on AHR were not as strong, but Th2 was not changed, in fact in some instances Th2 was higher in IgM-deficient mice.

      Fig. 3. Gene expression di]erences between WT and KO mice in PBS and HDM challenged settings are shown. PCA analysis does not show clear di]erences between all four groups, but genes are certainly up and downregulated, in particular when comparing PBS to HDM challenged mice. In both PBS and HDM challenged settings, three genes stand out as being upregulated in WT v KO mice. these are Baiap2l1, erdr1 and Chil1. 

      Noted

      Fig. 4. The authors attempt to quantify BAIAP2L1 in mouse lungs. It is di]icult to know if the antibody used really detects the correct protein. A BAIAP2L1-KO is not used as a control for staining, and I am not sure if competitive assays for BAIAP2L1 can be set up. The flow data is not convincing. The immunohistochemistry shows BAIAP2L1 (in red) in many, many cells, essentially throughout the section. There is also no discernible di]erence between WT and KO mice, which one might have expected based on the RNA-Seq data. So, from my perspective, it is hard to say if/where this protein is located, and whether there truly exists a di]erence in expression between wt and ko mice. 

      We thank the reviewer for this comment. We are certain that the antibody does detect BAIAP2L1, we have used it in 3 assays, which we admit may show varying specificities since it’s a Polyclonal antibody. However, in our western blot (Figure 5A), the antibody detects a band at 56.7kDa, apart from what we think are isoforms. We agree that BAIAP2L1 is expressed by many cell types, including CD45+ cells and alpha smooth muscle negative cells and we show this in our Figure 5 – figure supplement 1A and B. Where we think there is a di]erence in expression between WT and IgM-deficient mice is in alpha-smooth muscle-positive cells. We have tested antibodies from di]erent companies (Proteintech and Abcam), and we find similar findings. We do not have access to BAIAP2L1 KO mice and to test specificity, we have also used single stain controls with or without secondary antibody and isotype control which show no binding in western blot and Immunofluorescence assays and Fluorescence minus one antibody in Flow cytometry, so that way we are convinced that the signal we are seeing is specific to BAIAP2L1.

      Here we have also added additional Flow cytometry images using anti-BAIAP2L1 (clone 25692-1-AP) from Proteintech

      Author response image 3.

      Figure similar to Figure 5C and Figure 5 -figure supplement 1A and B.

      Fig. 5 and 6. The authors use a single cell contractility assay to measure whether BAIAP2L1 and ERDR1 impact on bronchial smooth muscle cell contractility. I am not familiar with the assay, but it looks like an interesting way of analysing contractility at the single cell level.

      The authors state that targeting these two genes with Cas9gRNA reduces smooth muscle cell contractility, and the data presented for contractility supports this observation. However, the e]iciency of Cas9-mediated deletion is very unclear. The authors present a PCR in supp fig 9c as evidence of gene deletion, but it is entirely unclear with what e]iciency the gene has been deleted. One should use sequencing to confirm deletion. Moreover, if the antibody was truly working, one should be able to use the antibody used in Fig 4 to detect BAIAP2L1 levels in these cells. The authors do not appear to have tried this. 

      We thank the reviewer for these observations. We are in a process to optimise this using new polyclonal BAIAP2L1 antibodies from other companies, since the one we have tried doesn’t seem to work well on human cells via western blot. So hopefully in our new version, we will be able to demonstrate this by immunofluorescence or western blot.

      Other impressions: 

      The paper is lacking a link between the deficiency of IgM and the e]ects on smooth muscle cell contraction. 

      The levels of IL-13 and TNF in lavage of WT and IGMKO mice could be analysed. 

      We have measured Th2 cytokine IL-13 in BAL fluid and found no di]erences between IgM-deficient mice and WT mice challenged with HDM (Author response image 4 below). We could not detected TNF-alpha in the BAL fluid, it was below detection limit.

      Figure legend. IL-13 levels are not changed in IgM-deficient mice in the lung. Bronchoalveolar lavage fluid in WT or IgM-deficient mice sensitised and challenged with HDM. TNF-a levels were below the detection limit.

      Author response image 4.

      Moreover, what is the impact of IgM itself on smooth muscle cells? In the Fig. 7 schematic, are the authors proposing a direct role for IgM on smooth muscle cells? Does IgM in cell culture media induce contraction of SMC? This could be tested and would be interesting, to my mind. 

      We thank the Reviewer for these comments. We are still trying to test this, unfortunately, we have experienced delays in getting reagents such as human IgM to South Africa. We hope that we will be able to add this in our subsequent versions of the article. We agree it is an interesting experiment to do even if not for this manuscript but for our general understanding of this interaction at least in an in vitro system.

      Reviewer #3 (Public Review): 

      Summary: 

      This paper by Sabelo et al. describes a new pathway by which lack of IgM in the mouse lowers bronchial hyperresponsiveness (BHR) in response to metacholine in several mouse models of allergic airway inflammation in Balb/c mice and C57/Bl6 mice. Strikingly, loss of IgM does not lead to less eosinophilic airway inflammation, Th2 cytokine production or mucus metaplasia, but to a selective loss of BHR. This occurs irrespective of the dose of allergen used. This was important to address since several prior models of HDM allergy have shown that the contribution of B cells to airway inflammation and BHR is dose dependent. 

      After a description of the phenotype, the authors try to elucidate the mechanisms. There is no loss of B cells in these mice. However, there is a lack of class switching to IgE and IgG1, with a concomitant increase in IgD. Restoring immunoglobulins with transfer of naïve serum in IgM deficient mice leads to restoration of allergen-specific IgE and IgG1 responses, which is not really explained in the paper how this might work. There is also no restoration of IgM responses, and concomitantly, the phenotype of reduced BHR still holds when serum is given, leading authors to conclude that the mechanism is IgE and IgG1 independent. Wild type B cell transfer also does not restore IgM responses, due to lack of engraftment of the B cells. Next authors do whole lung RNA sequencing and pinpoint reduced BAIAP2L1 mRNA as the culprit of the phenotype of IgM-/- mice. However, this cannot be validated fully on protein levels and immunohistology since di]erences between WT and IgM KO are not statistically significant, and B cell and IgM restoration are impossible. The histology and flow cytometry seems to suggest that expression is mainly found in alpha smooth muscle positive cells, which could still be smooth muscle cells or myofibroblasts. Next therefore, the authors move to CRISPR knock down of BAIAP2L1 in a human smooth muscle cell line, and show that loss leads to less contraction of these cells in vitro in a microscopic FLECS assay, in which smooth muscle cells bind to elastomeric contractible surfaces. 

      Strengths: 

      (1) There is a strong reduction in BHR in IgM-deficient mice, without alterations in B cell number, disconnected from e]ects on eosinophilia or Th2 cytokine production.

      (2) BAIAP2L1 has never been linked to asthma in mice or humans 

      Weaknesses: 

      (1) While the observations of reduced BHR in IgM deficient mice are strong, there is insu]icient mechanistic underpinning on how loss of IgM could lead to reduced expression of BAIAP2L1. Since it is impossible to restore IgM levels by either serum or B cell transfer and since protein levels of BAIAP2L1 are not significantly reduced, there is a lack of a causal relationship that this is the explanation for the lack of BHR in IgMdeficient mice. The reader is unclear if there is a fundamental (maybe developmental) di]erence in non-hematopoietic cells in these IgM-deficient mice (which might have accumulated another genetic mutation over the years). In this regard, it would be important to know if littermates were newly generated, or historically bred along with the KO line. 

      We thank the reviewer for asking this question and getting us to think of this in a di]erent way. This prompted us to use a di]erent method to try and restore IgM function and since our animal facility no longer allows irradiation, we opted for busulfan. We present this data as new data in Figure 3. We had to go back and breed this strain and then generated bone marrow chimeras. What we have shown now with chimeras is that if we can deplete bone marrow from IgM-deficient mice and replace it with congenic WT bone marrow when we allow these mice to rest for 2 months before challenge with HDM (Figure 3 -figure supplement 1A-C) We also show that AHR (resistance and elastance) is partially restored in this way (Figure 3A and B) as mice that receive congenic WT bone marrow after chemical irradiation can mount AHR and those that receive IgM-deficient bone marrow, can’t mount AHR upon challenge with HDM. If the mice had accumulated an unknown genetic mutation in non-hematopoietic cells, the transfer of WT bone marrow would not make a di]erence. So, we don’t believe the colony could have gained a mutation that we are unaware of. We have also shipped these mice to other groups and in their hands, this strains still only behaves as an IgM only knockout mice. See their publication below.

      Mark Noviski, James L Mueller, Anne Satterthwaite, Lee Ann Garrett-Sinha, Frank Brombacher, Julie Zikherman 2018. IgM and IgD B cell receptors di]erentially respond to endogenous antigens and control B cell fate. eLife 2018;7:e35074. DOI: https://doi.org/10.7554/eLife.35074

      we have also added methods for bone marrow chimaeras and added results sections and new Figures related to these methods.

      Methods appear in line 521-532 of the untracked version of the article.

      Busulfan Bone marrow chimeras

      WT (CD45.2) and IgM<sup>-/-</sup> (CD45.2) congenic mice were treated with 25 mg/kg busulfan (Sigma-Aldrich, Aston Manor, South Africa) per day for 3 consecutive days (75 mg/kg in total) dissolved in 10% DMSO and Phosphate bu]ered saline (0.2mL, intraperitoneally) to ablate bone marrow cells. Twenty-four hours after last administration of busulfan, mice were injected intravenously with fresh bone marrow (10x10<sup>6</sup> cells, 100µL) isolated from hind leg femurs of either WT (CD45.1) or IgM<sup>-/-</sup> mice [33]. Animals were then allowed to complement their haematopoietic cells for 8 weeks. In some experiments the level of bone marrow ablation was assessed 4 days post-busulfan treatment in mice that did not receive donor cells. At the end of experiment level of complemented cells were also assessed in WT and IgM<sup>-/-</sup> mice that received WT (CD45.1) bone marrow. 

      Results appear in line 198-228 of the untracked version of the article

      Replacement of IgM-deficient mice with functional hematopoietic cells in busulfan mice chimeric mice restores airway hyperresponsiveness.

      We then generated bone marrow chimeras by chemical radiation using busulfan (Montecino-Rodriguez and Dorshkind, 2020). We treated mice three times with busulfan for 3 consecutive days and after 24 hrs transferred naïve bone marrow from congenic CD45.1 WT mice or CD45.2 IgM KO mice (Figure 3A and Figure 3 -figure supplement 1A). We showed that recipient mice that did not receive donor bone marrow after 4 days post-treatment had significantly reduced lineage markers (CD45<sup>+</sup>Sca-1<sup>+</sup>) or lineage negative (Lin<sup>-</sup>) cells in the bone marrow when compared to untreated or vehicle (10% DMSO) treated mice (Figure 3 -figure supplements 1B-C). We allowed mice to reconstitute bone marrow for 8 weeks before sensitisation and challenge with low dose HDM (Figure 3A). We showed that WT (CD45.2) recipient mice that received WT (CD45.1) donor bone marrow had higher airway resistance and elastance and this was comparable to IgM KO (CD45.2) recipient mice that received donor WT (CD45.1) bone marrow (Figure 3B). As expected, IgM KO (CD45.2) recipient mice that received donor IgM KO (CD45.2) bone marrow had significantly lower AHR compared to WT (CD45.2) or IgM KO (CD45.2) recipient mice that received WT (CD45.1) bone marrow (Figure 3B). We confirmed that the di]erences observed were not due to di]erences in bone marrow reconstitution as we saw similar frequencies of CD45.1 cells within the lymphocyte populations in the lungs and other tissues (Figure 3 -figure supplement 1D). We observed no significant changes in the lung neutrophils, eosinophils, inflammatory macrophages, CD4 T cells or B cells in WT or IgM KO (CD45.2) recipient mice that received donor WT (CD45.1/CD45.2) or IgM KO (CD45.2) bone marrow when sensitised and challenged with low dose HDM (Figure 3C).

      Restoring IgM function through adoptive reconstitution with congenic CD45.1 bone marrow in non-chemically irradiated recipient mice or sorted B cells into IgM KO mice (Figure 2 -figure supplement 1A) did not replenish IgM B cells to levels observed in WT mice and as a result did not restore AHR, total IgE and IgM in these mice (Figure 2 -figure supplements 1B-C). 

      The 2 new figures are Figure 3 which moved the rest of the Figures down and Figure 3- figure supplement 1AD), which also moved the rest of the supplementary figures down.

      Discussion appears in line 410-419 of the untracked version of the article.To resolve other endogenous factors that could have potentially influenced reduced AHR in IgM-deficient mice, we resorted to busulfan chemical irradiation to deplete bone marrow cells in IgM-deficient mice and replace bone marrow with WT bone marrow. While it is well accepted that busulfan chemical irradiation partially depletes bone marrow cells, in our case it was not possible to pursue other irradiation methods due to changes in ethical regulations and that fact that mice are slow to recover after gamma rays irradiation. Busulfan chemical irradiation allowed us to show that we could mostly restore AHR in IgM-deficient recipient mice that received donor WT bone marrow when challenged with low dose HDM.

      (2) There is no mention of the potential role of complement in activation of AHR, which might be altered in IgM-deficient mice   

      We thank the reviewer for this comment. We have not directly looked at complement in this instance, however, from our previous work on C3 knockout mice, there have been comparable AHR to WT mice under the HDM challenge.

      (3) What is the contribution of elevated IgD in the phenotype of the IgM-deficient mice. It has been described by this group that IgD levels are clearly elevated 

      We thank the reviewer for this question. We believe that IgD is essentially what drives partial class switching to IgG, we certainly have shown that in the case of VSV virus and Trypanosoma congolense and Trypanosoma brucei brucei that elevated IgD drive delayed but e]ective IgG in the absence of IgM (Lutz et al, 2001, Nature). This is also confirmed by Noviski et al., 2018 eLife study where they show that both IgM and IgD do share some endogenous antigens, so its likely that external antigens can activate IgD in a similar manner to prompt class switching.

      (4) How can transfer of naïve serum in class switching deficient IgM KO mice lead to restoration of allergen specific IgE and IgG1? 

      We thank the Reviewer for these comments, we believe that naïve sera transferred to IgM deficient mice is able to bind to the surface of B cells via IgM receptors (FcμR / Fcα/μR), which are still present on B cells and this is su]icient to facilitate class switching. Our IgM KO mouse lacks both membrane-bound and secreted IgM, and transferred serum contains at least secreted IgM which can bind to surfaces via its Fc portion. We measured HDM-specific IgE and we found very low levels, but these were not di]erent between WT and IgM KO adoptively transferred with WT serum. We also detected HDM-specific IgG1 in IgM KO transferred with WT sera to the same level as WT, confirming a possible class switching, of course, we can’t rule out that transferred sera also contains some IgG1. We also can’t rule out that elevated IgD levels can partially be responsible for class switched IgG1 as discussed above.

      In the discussion line 463-464, we also added the following

      “We speculate that IgM can directly activate smooth muscle cells by binding a number of its surface receptors including FcμR, Fcα/μR and pIgR (Liu et al., 2019; Nguyen et al., 2017b; Shibuya et al., 2000). IgM binds to FcμR strictly, but shares Fcα/μR and pIgR with IgA (Liu et al., 2019; Michaud et al., 2020; Nguyen et al., 2017b). Both Fcα/μR and pIgR can be expressed by non-structural cells at mucosal sites (Kim et al., 2014; Liu et al., 2019). We would not rule out that the mechanisms of muscle contraction might be through one of these IgM receptors, especially the ones expressed on smooth muscle cells(Kim et al., 2014; Liu et al., 2019). Certainly, our future studies will be directed towards characterizing the mechanism by which IgM potentially activates the smooth muscle.”

      We have discussed this section under Discussion section, line 731 to 757. In addition, since we have now performed bone marrow chimaeras we have further added the following in our discussion in line 410-419.

      To resolve other endogenous factors that could have potentially influenced reduced AHR in IgM-deficient mice, we resorted to busulfan chemical irradiation to deplete bone marrow cells in IgM-deficient mice and replace bone marrow with WT bone marrow. While it is well accepted that busulfan chemical irradiation partially depletes bone marrow cells, in our case it was not possible to pursue other irradiation methods due to changes in ethical regulations and that fact that mice are slow to recover after gamma rays irradiation. Busulfan chemical irradiation allowed us to show that we could mostly restore AHR in IgM-deficient recipient mice that received donor WT bone marrow when challenged with low dose HDM. 

      We removed the following lines, after performing bone marrow chimaeras since this changed some aspects. 

      Our efforts to adoptively transfer wild-type bone marrow or sorted B cells into IgMdeficient mice were also largely unsuccessful partly due to poor engraftment of wildtype B cells into secondary lymphoid tissues. Natural secreted IgM is mainly produced by B1 cells in the peritoneal cavity, and it is likely that any transfer of B cells via bone marrow transfer would not be su]icient to restore soluble levels of IgM<sup>3,10</sup>.

      (5) lpha smooth muscle antigen is also expressed by myofibroblasts. This is insu]iciently worked out. The histology mentions "expression in cells in close contact with smooth muscle". This needs more detail since it is a very vague term. Is it in smooth muscle or in myofibroblasts. 

      We appreciate that alpha-smooth muscle actin-positive cells are a small fraction in the lung and even within CD45 negative cells, but their contribution to airway hyperresponsiveness is major. We also concede that by immunofluorescence BAIAP2L1 seems to be expressed by cells adjacent to alpha-smooth muscle actin (Figure 5B), however, we know that cells close to smooth muscle (such as extracellular matrix and myofibroblasts) contribute to its hypertrophy in allergic asthma.

      James AL, Elliot JG, Jones RL, Carroll ML, Mauad T, Bai TR, et al. Airway Smooth Muscle Hypertrophy and Hyperplasia in Asthma. Am J Respir Crit Care Med [Internet]. 2012; 185:1058–64. Available from: https://doi.org/10.1164/rccm.201110-1849OC

      (6) Have polymorphisms in BAIAP2L1 ever been linked to human asthma? 

      No, we have looked in asthma GWAS studies, at least summary statistics and we have not seen any SNPs that could be associated with human asthma.

      (7) IgM deficient patients are at increased risk for asthma. This paper suggests the opposite. So the translational potential is unclear 

      We thank the reviewer for these comments. At the time of this publication, we have not made a concrete link with human disease. While there is some anecdotal evidence of diseases such as Autoimmune glomerulonephritis, Hashimoto’s thyroiditis, Bronchial polyp, SLE, Celiac disease and other diseases in people with low IgM. Allergic disorders are also common in people with IgM deficiency as the reviewer correctly points out, other studies have reported as high as 33-47%. The mechanisms for the high incidence of allergic diseases are unclear as generally, these patients have normal or higher IgG and IgE levels. IgM deficiency may represent a heterogeneous spectrum of genetic defects, which might explain the heterogeneous nature of disease presentations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      In this study, the authors trained a variational autoencoder (VAE) to create a high-dimensional "voice latent space" (VLS) using extensive voice samples, and analyzed how this space corresponds to brain activity through fMRI studies focusing on the temporal voice areas (TVAs). Their analyses included encoding and decoding techniques, as well as representational similarity analysis (RSA), which showed that the VLS could effectively map onto and predict brain activity patterns, allowing for the reconstruction of voice stimuli that preserve key aspects of speaker identity.

      Strengths:

      This paper is well-written and easy to follow. Most of the methods and results were clearly described. The authors combined a variety of analytical methods in neuroimaging studies, including encoding, decoding, and RSA. In addition to commonly used DNN encoding analysis, the authors performed DNN decoding and resynthesized the stimuli using VAE decoders. Furthermore, in addition to machine learning classifiers, the authors also included human behavioral tests to evaluate the reconstruction performance.

      Weaknesses:

      This manuscript presents a variational autoencoder (VAE) to evaluate voice identity representations from brain recordings. However, the study's scope is limited by testing only one model, leaving unclear how generalizable or impactful the findings are. The preservation of identity-related information in the voice latent space (VLS) is expected, given the VAE model's design to reconstruct original vocal stimuli. Nonetheless, the study lacks a deeper investigation into what specific aspects of auditory coding these latent dimensions represent. The results in Figure 1c-e merely tested a very limited set of speech features. Moreover, there is no analysis of how these features and the whole VAE model perform in standard speech tasks like speech recognition or phoneme recognition. It is not clear what kind of computations the VAE model presented in this work is capable of. Inclusion of comparisons with state-of-the-art unsupervised or self-supervised speech models known for their alignment with auditory cortical responses, such as Wav2Vec2, HuBERT, and Whisper, would strengthen the validation of the VAE model and provide insights into its relative capabilities and limitations.

      The claim that the VLS outperforms a linear model (LIN) in decoding tasks does not significantly advance our understanding of the underlying brain representations. Given the complexity of auditory processing, it is unsurprising that a nonlinear model would outperform a simpler linear counterpart. The study could be improved by incorporating a comparative analysis with alternative models that differ in architecture, computational strategies, or training methods. Such comparisons could elucidate specific features or capabilities of the VLS, offering a more nuanced understanding of its effectiveness and the computational principles it embodies. This approach would allow the authors to test specific hypotheses about how different aspects of the model contribute to its performance, providing a clearer picture of the shared coding in VLS and the brain.

      The manuscript overlooks some crucial alternative explanations for the discriminant representation of vocal identity. For instance, the discriminant representation of vocal identity can be either a higher-level abstract representation or a lower-level coding of pitch height. Prior studies using fMRI and ECoG have identified both types of representation within the superior temporal gyrus (STG) (e.g., Tang et al., Science 2017; Feng et al., NeuroImage 2021). Additionally, the methodology does not clarify whether the stimuli from different speakers contained identical speech content. If the speech content varied across speakers, the approach of averaging trials to obtain a mean vector for each speaker-the "identity-based analysis"-may not adequately control for confounding acoustic-phonetic features. Notably, the principal component 2 (PC2) in Figure 1b appears to correlate with absolute pitch height, suggesting that some aspects of the model's effectiveness might be attributed to simpler acoustic properties rather than complex identity-specific information.

      Methodologically, there are issues that warrant attention. In characterizing the autoencoder latent space, the authors initialized logistic regression classifiers 100 times and calculated the tstatistics using degrees of freedom (df) of 99. Given that logistic regression is a convex optimization problem typically converging to a global optimum, these multiple initializations of the classifier were likely not entirely independent. Consequently, the reported degrees of freedom and the effect size estimates might not accurately reflect the true variability and independence of the classifier outcomes. A more careful evaluation of these aspects is necessary to ensure the statistical robustness of the results.

      We thank Reviewer #1 for their thoughtful and constructive comments. Below, we address the key points raised:

      New comparitive models. We agree there are still many open questions on the structure of the VLS and the specific aspects of auditory coding that its latent dimensions represent. The features tested in Figure 1c-e are not speech features, but aspects related to speaker identity: age, gender and unique identity. Nevertheless we agree the VLS could be compared to recent speech models (not available when we started this project): we have now included comparisons with Wav2Vec and HuBERT in the encoding section (new Figure 2-S3). The comparison of encoding results based on LIN, the VLS, Wav2Vec and HuBERT (new Fig2S3) indicates no clear superiority of one model over the others; rather, different sets of voxels are better explained by the different models. Interestingly all four models yielded best encoding results for the m and a TVA, indicating some consistency across models.

      On decoding directly from spectrograms. We have now added decoding results obtained directly from spectrograms, as requested in the private review. These are presented in the revised Figure 4, and allow for comparison with the LIN- and VLS-based reconstructions. As noted, spectrogram-based reconstructions sounded less vocal-like and faithful to the original, confirming that the latent spaces capture more abstract and cerebral-like voice representations.

      On the number and length of stimuli. The rationale for using a large number of brief, randomly spliced speech excerpts from different languages was to extract identity features independent of specific linguistic cues. Indeed, the PC2 could very well correlate with pitch; we were not able to extract reliable f0 information from the thousands of brief stimuli, many of which are largely inharmonic (e.g., fricatives), such that this assumption could not be tested empirically. But it would be relevant that the weight of PC2 correlates with pitch: although the average fundamental frequency of phonation is not a linguistic cue, it is a major acoustical feature differentiating speaker identities.

      Statistics correction.  To address the issue of potential dependence between multiple runs of logistic regression, we replaced our previous analysis with a Wilcoxon signedrank test comparing decoding accuracies to chance. The results remain significant across classifications, and the revised figure and text reflect this change.

      Reviewer #2 (Public Review):

      Summary:

      Lamothe et al. collected fMRI responses to many voice stimuli in 3 subjects. The authors trained two different autoencoders on voice audio samples and predicted latent space embeddings from the fMRI responses, allowing the voice spectrograms to be reconstructed. The degree to which reconstructions from different auditory ROIs correctly represented speaker identity, gender, or age was assessed by machine classification and human listener evaluations. Complementing this, the representational content was also assessed using representational similarity analysis. The results broadly concur with the notion that temporal voice areas are sensitive to different types of categorical voice information.

      Strengths:

      The single-subject approach that allows thousands of responses to unique stimuli to be recorded and analyzed is powerful. The idea of using this approach to probe cortical voice representations is strong and the experiment is technically solid.

      Weaknesses:

      The paper could benefit from more discussion of the assumptions behind the reconstruction analyses and the conclusions it allows. The authors write that reconstruction of a stimulus from brain responses represents 'a robust test of the adequacy of models of brain activity' (L138). I concur that stimulus reconstruction is useful for evaluating the nature of representations, but the notion that they can test the adequacy of the specific autoencoder presented here as a model of brain activity should be discussed at more length. Natural sounds are correlated in many feature dimensions and can therefore be summarized in several ways, and similar information can be read out from different model representations. Models trained to reconstruct natural stimuli can exploit many correlated features and it is quite possible that very different models based on different features can be used for similar reconstructions. Reconstructability does not by itself imply that the model is an accurate brain model. Non-linear networks trained on natural stimuli are arguably not tested in the same rigorous manner as models built to explicitly account for computations (they can generate predictions and experiments can be designed to test those predictions). While it is true that there is increasing evidence that neural network embeddings can predict brain data well, it is still a matter of debate whether good predictability by itself qualifies DNNs as 'plausible computational models for investigating brain processes' (L72). This concern is amplified in the context of decoding and naturalistic stimuli where many correlated features can be represented in many ways. It is unclear how much the results hinge on the specificities of the specific autoencoder architectures used. For instance, it would be useful to know the motivations for why the specific VAE used here should constitute a good model for probing neural voice representations.

      Relatedly, it is not clear how VAEs as generative models are motivated as computational models of voice representations in the brain. The task of voice areas in the brain is not to generate voice stimuli but to discriminate and extract information. The task of reconstructing an input spectrogram is perhaps useful for probing information content, but discriminative models, e.g., trained on the task of discriminating voices, would seem more obvious candidates. Why not include discriminatively trained models for comparison?

      The autoencoder learns a mapping from latent space to well-formed voice spectrograms. Regularized regression then learns a mapping between this latent space and activity space. All reconstructions might sound 'natural', which simply means that the autoencoder works. It would be good to have a stronger test of how close the reconstructions are to the original stimulus. For instance, is the reconstruction the closest stimulus to the original in latent space coordinates out of using the experimental stimuli, or where does it rank? How do small changes in beta amplitudes impact the reconstruction? The effective dimensionality of the activity space could be estimated, e.g. by PCA of the voice samples' contrast maps, and it could then be estimated how the main directions in the activity space map to differences in latent space. It would be good to get a better grasp of the granularity of information that can be decoded/ reconstructed.

      What can we make of the apparent trend that LIN is higher than VLS for identity classification (at least VLS does not outperform LIN)? A general argument of the paper seems to be that VLS is a better model of voice representations compared to LIN as a 'control' model. Then we would expect VLS to perform better on identity classification. The age and gender of a voice can likely be classified from many acoustic features that may not require dedicated voice processing.

      The RDM results reported are significant only for some subjects and in some ROIs. This presumably means that results are not significant in the other subjects. Yet, the authors assert general conclusions (e.g. the VLS better explains RDM in TVA than LIN). An assumption typically made in single-subject studies (with large amounts of data in individual subjects) is that the effects observed and reported in papers are robust in individual subjects. More than one subject is usually included to hint that this is the case. This is an intriguing approach. However, reports of effects that are statistically significant in some subjects and some ROIs are difficult to interpret. This, in my view, runs contrary to the logic and leverage of the single-subject approach. Reporting results that are only significant in 1 out of 3 subjects and inferring general conclusions from this seems less convincing.

      The first main finding is stated as being that '128 dimensions are sufficient to explain a sizeable portion of the brain activity' (L379). What qualifies this? From my understanding, only models of that dimensionality were tested. They explain a sizeable portion of brain activity, but it is difficult to follow what 'sizable' is without baseline models that estimate a prediction floor and ceiling. For instance, would autoencoders that reconstruct any spectrogram (not just voice) also predict a sizable portion of the measured activity? What happens to reconstruction results as the dimensionality is varied?

      A second main finding is stated as being that the 'VLS outperforms the LIN space' (L381). It seems correct that the VAE yields more natural-sounding reconstructions, but this is a technical feature of the chosen autoencoding approach. That the VLS yields a 'more brain-like representational space' I assume refers to the RDM results where the RDM correlations were mainly significant in one subject. For classification, the performance of features from the reconstructions (age/ gender/ identity) gives results that seem more mixed, and it seems difficult to draw a general conclusion about the VLS being better. It is not clear that this general claim is well supported.

      It is not clear why the RDM was not formed based on the 'stimulus GLM' betas. The 'identity GLM' is already biased towards identity and it would be stronger to show associations at the stimulus level.

      Multiple comparisons were performed across ROIs, models, subjects, and features in the classification analyses, but it is not clear how correction for these multiple comparisons was implemented in the statistical tests on classification accuracies.

      Risks of overfitting and bias are a recurrent challenge in stimulus reconstruction with fMRI. It would be good with more control analyses to ensure that this was not the case. For instance, how were the repeated test stimuli presented? Were they intermingled with the other stimuli used for training or presented in separate runs? If intermingled, then the training and test data would have been preprocessed together, which could compromise the test set. The reconstructions could be performed on responses from independent runs, preprocessed separately, as a control. This should include all preprocessing, for instance, estimating stimulus/identity GLMs on separately processed run pairs rather than across all runs. Also, it would be good to avoid detrending before GLM denoising (or at least testing its effects) as these can interact.

      We appreciate Reviewer #2’s careful reading and numerous suggestions for improving clarity and presentation. We have implemented the suggested text edits, corrected ambiguities, and clarified methodological details throughout the manuscript. In particular, we have toned down several sentences that we agree were making strong claims (L72, L118, L378, L380-381).

      Clarifications, corrections and additional information:

      We streamlined the introduction by reducing overly specific details and better framing the VLS concept before presenting specifics.

      Clarified the motivation for the age classification split and corrected several inaccuracies and ambiguities in the methods, including the hearing thresholds, balancing of category levels, and stimulus energy selection procedure.

      Provided additional information on the temporal structure of runs and experimental stimuli selection.

      Corrected the description of technical issues affecting one participant and ensured all acronyms are properly defined in the text and figure legends.

      Confirmed that audiograms were performed repeatedly to monitor hearing thresholds and clarified our use of robust scaling and normalization procedures.

      Regarding the test of RDM correlations, we clarified in the text that multiple comparisons were corrected using a permutation-based framework.

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript, Lamothe et al. sought to identify the neural substrates of voice identity in the human brain by correlating fMRI recordings with the latent space of a variational autoencoder (VAE) trained on voice spectrograms. They used encoding and decoding models, and showed that the "voice" latent space (VLS) of the VAE performs, in general, (slightly) better than a linear autoencoder's latent space. Additionally, they showed dissociations in the encoding of voice identity across the temporal voice areas.

      Strengths:

      The geometry of the neural representations of voice identity has not been studied so far. Previous studies on the content of speech and faces in vision suggest that such geometry could exist. This study demonstrates this point systematically, leveraging a specifically trained variational autoencoder. 

      The size of the voice dataset and the length of the fMRI recordings ensure that the findings are robust.

      Weaknesses:

      Overall, the VLS is often only marginally better than the linear model across analysis, raising the question of whether the observed performance improvements are due to the higher number of parameters trained in the VAE, rather than the non-linearity itself. A fair comparison would necessitate that the number of parameters be maintained consistently across both models, at least as an additional verification step.

      The encoding and RSM results are quite different. This is unexpected, as similar embedding geometries between the VLS and the brain activations should be reflected by higher correlation values of the encoding model.

      The consistency across participants is not particularly high, for instance, S1 seemed to have demonstrated excellent performances, while S2 showed poor performance.

      An important control analysis would be to compare the decoding results with those obtained by a decoder operating directly on the latent spaces, in order to further highlight the interest of the non-linear transformations of the decoder model. Currently, it is unclear whether the non-linearity of the decoder improves the decoding performance, considering the poor resemblance between the VLS and brain-reconstructed spectrograms.

      We thank Reviewer #3 for their comments. In response:

      Code and preprocessed data are now available as indicated in the revised manuscript.

      While we appreciate the suggestion to display supplementary analyses as boxplots split by hemisphere, we opted to retain the current format as we do not have hypotheses regarding hemispheric lateralization, and the small sample size per hemisphere would preclude robust conclusions.

      Confirmed that the identities in Figure 3a are indeed ordered by age and have clarified this in the legend.

      The higher variance observed in correlations for the aTVA in Figure 3b reflects the small number of data points (3 participants × 2 hemispheres), and this is now explained.

      Regarding the cerebral encoding of gender and age, we acknowledge this interesting pattern. Prior work (e.g., Charest et al., 2013) found overlapping processing regions for voice gender without clear subregional differences in the TVAs. Evidence on voice age encoding remains sparse, and we highlight this novel finding in our discussion.

      We again thank the reviewers for their insightful comments, which have greatly improved the quality and clarity of our work.

      Reviewer #1 (Recommendations For The Authors):

      (1) A set of recent advances have shown that embeddings of unsupervised/self-supervised speech models aligned to auditory responses to speech in the temporal cortex (e.g. Wav2Vec2: Millet et al NeurIPS 2022; HuBERT: Li et al. Nat Neurosci 2023; Whisper: Goldstein et al.bioRxiv 2023). These models are known to preserve a variety of speech information (phonetics, linguistic information, emotions, speaker identity, etc) and perform well in a variety of downstream tasks. These other models should be evaluated or at least discussed in the study. 

      We fully agree - the pace of progress in this area of voice technology has been incredible. Many of these models were not yet available at the time this work started so we could not use them in our comparison with cerebral representations.

      We have now implemented Reviewer #1’s suggestion and evaluated Wav2Vec and HuBERT. The results are presented in supplementary Figure 2-S3. Correlations between activity predicted by the model and the real activity were globally comparable with those obtained with the LIN and VLS models. Interestingly both HuBERT and Wav2Vec yielded highest correlations in the mTVA, and to a lesser extent, the aTVA, as the LIN and VLS models.

      (2) The test statistics of the results in Fig 1c-e need to be revised. Given that logistic regression is a convex optimization problem typically converging to a global optimum, these multiple initializations of the classifier were likely not entirely independent. Consequently, the reported degrees of freedom and the effect size estimates might not accurately reflect the true variability and independence of the classifier outcomes. A more careful evaluation of these aspects is necessary to ensure the statistical robustness of the results. 

      We thank Reviewer #1 for pointing out this important issue regarding the potential dependence between multiple runs of the logistic regression model. To address this concern, we have revised our analyses and used a Wilcoxon signed-rank test to compare the decoding accuracy to chance level. The results showed that the accuracy was significantly above chance for all classifications (Wilcoxon signed-rank test, all W=15, p=0.03125). We updated Figure 1c-e and the corresponding text (L154-L155) to reflect the revised analysis. Because the focus of this section is to probe the informational content of the autoencoder’s latent spaces, and since there are only 5 decoding accuracy values per model, we dropped the inter-model statistical test.

      (3) In Line 198, the authors discuss the number of dimensions used in their models. To provide a comprehensive comparison, it would be informative to include direct decoding results from the original spectrograms alongside those from the VLS and LIN models. Given the vast diversity in vocal speech characteristics, it is plausible that the speaker identities might correlate with specific speech-related features also represented in both the auditory cortex and the VLS. Therefore, a clearer understanding of the original distribution of voice identities in the untransformed auditory space would be beneficial. This addition would help ascertain the extent to which transformations applied by the VLS or LIN models might be capturing or obscuring relevant auditory information.

      We have now implemented Reviewer #1’s suggestion. The graphs on the right panel b of revised Figure 4 now show decoding results obtained from the regression performed directly on the spectrograms, rather than on representations of them, for our two example test stimuli. They can be listened to and compared to the LIN- and VLS-based reconstructions in Supplementary Audio 2. Compared to the LIN and VLS, the SPEC-based reconstructions sounded much less vocal or similar to the original, indicating that the latent spaces indeed capture more abstract voice representations, more similar to cerebral ones.

      Reviewer #2 (Recommendations For The Authors): 

      L31: 'in voice' > consider rewording (from a voice?).

      L33: consider splitting sentence (after interactions). 

      L39: 'brain' after parentheses. 

      L45-: certainly DNNs 'as a powerful tool' extend to audio (not just image and video) beyond their use in brain models. 

      L52: listened to / heard. 

      L63: use second/s consistently. 

      L64: the reference to Figure 5D is maybe a bit confusing here in the introduction. 

      We thank Reviewer #2 for these recommendations, which we have implemented.

      L79-88: this section is formulated in a way that is too detailed for the introduction text (confusing to read). Consider a more general introduction to the VLS concept here and the details of this study later. 

      L99-: again, I think the experimental details are best saved for later. It's good to provide a feel for the analysis pipeline here, but some of the details provided (number of averages, denoising, preprocessing), are anyway too unspecific to allow the reader to fully follow the analysis. 

      Again, thank you for these suggestions for improving readability: we have modified the text accordingly.

      L159: what was the motivation for classifying age as a 2-class classification problem? Rather than more classes or continuous prediction? How did you choose the age split? 

      The motivation for the 2 age classes was to align on the gender classification task for better comparison. The cutoff (30 years) was not driven by any scientific consideration, but by practical ones, based on the median age in our stimulus set. This is now clarified in the manuscript (L149).

      L263: Is the test of RDM correlation>0 corrected for multiple comparisons across ROIs, subjects, and models?

      The test of RDM correlation>0 was indeed corrected for multiple comparisons for models using the permutation-based ‘maximum statistics’ framework for multiple comparison correction (described in Giordano et al., 2023 and Maris & Oostenveld, 2007). This framework was applied for each ROI and subject. It was described in the Methods (L745) but not clearly enough in the text—we thank Reviewer #2 and clarified it in the text (L246, L260-L261).

      L379: 'these stimuli' - weren't the experimental stimuli different from those used to train the V/AE? 

      We thank Reviewer #2 for spotting this issue. Indeed, the experimental stimuli are different from those used to train the models. We corrected the text to reflect this distinction (L84-L85).

      L443: what are 'technical issues' that prevented subject 3 from participating in 48 runs?? 

      We thank Reviewer #2 for pointing out the ambiguity in our previous statement. Participant 3 actually experienced personal health concerns that prevented them from completing the whole number of runs. We corrected this to provide a more accurate description (L442-L443).

      L444: participants were instructed to 'stay in the scanner'!? Do you mean 'stay still', or something? 

      We thank the Reviewer for spotting this forgotten word. We have corrected the passage (L444).

      L463: Hearing thresholds of 15 dB: do you mean that all had thresholds lower than 15 dB at all frequencies and at all repeated audiogram measurements? 

      We thank Reviewer #2 for spotting this error: we meant thresholds below 15dB HL. This has been corrected (L463). Indeed participants were submitted to several audiograms between fMRI sessions, to ensure no hearing loss could be caused by the scanner noise in these repeated sessions.

      L472: were the 4 category levels balanced across the dataset (in number of occurrences of each category combination)? 

      The dataset was fully balanced, with an equal number of samples for each combination of language, gender, age, and identity. Furthermore, to minimize potential adaptation effects, the stimuli were also balanced within each run according to these categories, and identity was balanced across sessions. We made this clearer in Main voice stimuli (L492-L496).

      L482: the test stimuli were selected as having high energy by the amplitude envelope. It is unclear what this means (how is the envelope extracted, what feature of it is used to measure 'high energy'?) 

      The selection of sounds with high energy was based on analyzing the amplitude envelope of each signal, which was extracted using the Hilbert transform and then filtered to refine the envelope. This envelope, which represents the signal's intensity over time, was used to measure the energy of each stimulus, and those that exceeded an arbitrary threshold were selected. From this pool of high-energy stimuli, likely including vowels, we selected six stimuli to be repeated during the scanning session, then reconstructed via decoding. This has been clarified in the text (L483-L484). 

      L500 was the audio filtered to account for the transfer function of the Sensimetrics headphones? 

      We did not perform any filtering, as the transfer function of the Sensimetrics is already very satisfactory as is. This has been clarified in the text (L503).

      L500: what does 'comfortable level' correspond to and was it set per session (i.e. did it vary across sessions)? 

      By comfortable we mean around 85 dB SPL. The audio settings were kept similar across sessions. This has been added to the text (L504).

      L526- does the normalization imply that the reconstructed spectrograms are normalized? Were the reconstructions then scaled to undo the normalization before inversion? 

      The paragraph on spectrogram standardization was not well placed inducing confusion. We have placed this paragraph in its more suitable location, in the Deep learning section (L545L550)

      L606: does the identity GLM model the denoised betas from the first GLM or simply the BOLD data? The text indicates the latter, but I suspect the former. 

      Indeed: this has been clarified (L601-L602).

      L704: could you unpack this a bit more? It is not easy to see why you specify the summing in the objective. Shouldn't this just be the ridge objective for a given voxel/ROI? Then you could just state it in matrix notation. 

      Thanks for pointing this out: we kept the formula unchanged but clarified the text, in particular specified that the voxel id is the ith index (L695).

      L716: you used robust scaling for the classifications in latent space but haven't mentioned scaling here. Are we to assume that the same applies?  

      Indeed we also used robust scaling here, this is now made clear (L710-L711).

      L720: Pearson correlation as a performance metric and its variance will depend on the choice of test/train split sizes. Can you show that the results generalize beyond your specific choices? Maybe the report explained variance as well to get a better idea of performance. 

      We used a standard 80/20 split. We think it is beyond the scope of this study to examine the different possible choices of splits, and prefer not to spend additional time on this point which we think is relatively minor.

      Could you specify (somewhere) the stimulus timing in a run? ISI and stimulus duration are mentioned in different places, but it would be nice to have a summary of the temporal structure of runs.

      This is now clarified at the beginning of the Methods section (L437-441)

      Reviewer #3 (Recommendations For The Authors):

      Code and data are not currently available. 

      Code and preprocessed data are now available (L826-827).

      In the supplementary material, it would be beneficial to present the different analyses as boxplots, as in the main text, but with the ROIs in the left and right hemispheres separated, to better show potential hemispheric effect. Although this information is available in the Supplementary Tables, it is currently quite tedious to access it. 

      Although we provide the complete data split by hemisphere in the Tables, we do not believe it is relevant to illustrate left/right differences, as we do not have any hypotheses regarding hemispheric lateralization–and we would be underpowered in any case to test them with only three points by hemisphere.

      In Figure 3a, it might be beneficial to order the identities by age for each gender in order to more clearly illustrate the structure of the RDMs,  

      The identities are indeed already ordered by increasing age: we now make this clear.

      In Figure 3b, the variance for the correlations for the aTVA is higher than in other regions, why? 

      Please note that the error bar indicates variance across only 6 data points (3 subjects x 2 hemispheres) such that some fluctuations are to be expected.

      Please make sure that all acronyms are defined, and that they are redefined in the figure legends. 

      This has been done.

      Gender and age are primarily encoded by different brain regions (Figure 5, pTVA vs aTVA). How does this finding compare with existing literature?

      This interesting finding was not expected. The cerebral processing of voice gender has been investigated by several groups including ours (Charest et al., 2013, Cerebral Cortex). Using an fMRI-adaptation design optimized using a continuous carry-over protocol and voice gender continua generated by morphing, we found that regions dealing with acoustical differences between voices of varying gender largely overlapped with the TVAs, without clear differentiation between the different subparts. Evidence for the role of the different TVAs in voice age processing remains scarce.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Summary:

      In this descriptive study, Tateishi et al. report a Tn-seq based analysis of genetic requirements for growth and fitness in 8 clinical strains of Mycobacterium intracellulare Mi), and compare the findings with a type strain ATCC13950. The study finds a core set of 131 genes that are essential in all nine strains, and therefore are reasonably argued as potential drug targets. Multiple other genes required for fitness in clinical isolates have been found to be important for hypoxic growth in the type strain.

      Strengths:

      The study has generated a large volume of Tn-seq datasets of multiple clinical strains of Mi from multiple growth conditions, including from mouse lungs. The dataset can serve as an important resource for future studies on Mi, which despite being clinically significant remains a relatively understudied species of mycobacteria.

      Thank you for the comment on the significance of our manuscript on the basic research of non-tuberculous mycobacteria.

      Weaknesses:

      The primary claim of the study that the clinical strains are better adapted for hypoxic growth is yet to be comprehensively investigated. However, this reviewer thinks such an investigation would require a complex experimental design and perhaps forms an independent study

      Thank you for the comment on the issue of the claim of better adaptation for hypoxic growth in the clinical strains being not completely revealed. We agree the reviewer’s comment that comprehensive investigation of adaptation for hypoxic growth in the clinical strains should be a future project in terms of the complexity of an experimental design.

      Reviewer #4 (Public review):

      Summary:

      In this study Tateishi et al. used TnSeq to identify 131 shared essential or growth defect-associated genes in eight clinical MAC-PD isolates and the type strain ATCC13950 of Mycobacterium intracellulare which are proposed as potential drug targets. Genes involved in gluconeogenesis and the type VII secretion system which are required for hypoxic pellicle-type biofilm formation in ATCC13950 also showed increased requirement in clinical strains under standard growth conditions. These findings were further confirmed in a mouse lung infection model.

      Strengths:

      This study has conducted TnSeq experiments in reference and 8 different clinical isolates of M. intracellulare thus producing large number of datasets which itself is a rare accomplishment and will greatly benefit the research community

      Thank you for the comment on the significance of our manuscript on the basic research of non-tuberculous mycobacteria.

      Weaknesses:

      (1) A comparative growth study of pure and mixed cultures of clinical and reference strains under hypoxia will be helpful in supporting the claim that clinical strains adapt better to such conditions. This should be mentioned as future directions in the discussion section along with testing the phenotype of individual knockout strains.

      Thank you for the comment on the idea of a comparative growth assay of pure and mixed cultures of clinical and reference strains under hypoxia. We appreciate the idea that showing the phenomenon of advantage of bacterial growth of the clinical strains under hypoxia in mixed culture with the ATCC strain would be important to strengthen the claim of better adaptation for hypoxic growth in the clinical strains. However, co-culture conditions introduce additional variables, including inter-strain competition or synergy, which can obscure the specific contributions of hypoxic adaptation in each strain. Therefore, we consider that our current approach using monoculture growth curves under defined oxygen conditions offers a clearer interpretation of strain-specific hypoxic responses.

      Following the comment, we have added the mention of the mixed culture experiment and the growth assay using individual knockout strains as future directions (page 35 lines 614-632 in the revised manuscript).

      “We have provided the data suggesting the preferential hypoxic adaptation in clinical strains compared to the ATCC type strain by the growth assay of individual strains. To strengthen our claim, several experiments are suggested including mixed culture experiments of clinical and reference strains under hypoxia. However, co-culture conditions introduce additional variables, including inter-strain competition or synergy, which can obscure the specific contributions of hypoxic adaptation in each strain. Therefore, we took the current approach using monoculture growth curves under defined oxygen conditions, which offers a clearer interpretation of strainspecific hypoxic responses. Furthermore, one of the limitations of this study is the lack of validation of TnSeq results with individual gene knockouts. Contrary to the case of Mtb, the technique of constructing knockout mutants of slow-growing NTM including M. intracellulare has not been established long time. We have just recently succeeded in constructing the vector plasmids for making knockout mutants of M intracellulare (Tateishi. Microbiol Immunol. 2024). Growth assay of individual knockout strains of genes showing increased genetic requirements such as pckA, glpX, csd, eccC5 and mycP5 in the clinical strains is suggested to provide the direct involvement of these genes on the preferential hypoxic adaptation in clinical strains. We have a future plan to construct knockout mutants of these genes to confirm the involvement of these genes on preferential hypoxic adaptation.”

      Reference

      Tateishi, Y., Nishiyama, A., Ozeki, Y. & Matsumoto, S. Construction of knockoutmutants in Mycobacterium intracellulare ATCC13950 strain using a thermosensitive plasmid containing negative selection marker rpsL<sup>+</sup>. Microbiol Immunol 68, 339-347 (2024).

      (2) Authors should provide the quantitative value of read counts for classifying a gene as "essential" or "non-essential" or "growth-defect" or "growthadvantage". Merely mentioning "no insertions in all or most of their TA sites" or "unusually low read counts" or "unusually high low read counts" is not clear

      Thank you for the comment on the issue of not providing the quantitative value of read counts for classifying the gene essentiality. In this study, we used an Hidden Markov Model (HMM) to predict gene essentiality. The HMM does not classify the 4 gene essentiality uniquely by the quantitative number of read counts but uses a probabilistic model to estimate the state at each TA based on the read counts and consistency with adjacent sites (Ioerger. Methods Mol Biol 2022).

      The HMM uses consecutive data of read counts and calculates transition probability for predicting gene essentiality across the genome. The HMM allows for the clustering of insertion sites into distinct regions of essentiality across the entire genome in a statistically rigorous manner, while also allowing for the detection of growth-defect and growth-advantage regions. The HMM can smooth over individual outlier values (such as an isolated insertion in any otherwise empty region, or empty sites scattered among insertion in a non-essential region) and make a call for a region/gene that integrates information over multiple sites. The gene-level calls are made based on the majority call among the TA sites within each gene. The HMM automatically tunes its internal parameters (e.g. transition probabilities) to the characteristics of the input datasets (saturation and mean insertion counts) and can work over a broad range of saturation levels (as low as 20%) (DeJesus. BMC Bioinformatics 2013). Thus, HMM can represent the more nuanced ways the growth of an organism might be affected by the disruption of its genes (https://orca1.tamu.edu/essentiality/Tn-HMM/index.html)

      Thus, the prediction of gene essentiality by the HMM does not rely on the quantitative threshold of Tn insertion reads independently at each TA site, but rather it is the most probable states for the whole sequence taken together (computed using Vitebri algorithm). Of the statistical methods, the HMM is a standard method for predicting gene essentiality in TnSeq (Ioerger TR. Methods Mol Biol. 2022) since a substantial number of TnSeq studies adopt this method for predicting gene essentiality (Akusobi. mBio 2025, DeJesus. mBio 2017, Dragset mSystems 2019, Mendum. BCG Genomics 2019). The HMM can be applied in many bioinformatics fields such as profiling functional protein families, identifying functional domains, sequence motif discoveries and gene prediction.

      Taken together, we do not have the quantitative value of read counts for classifying gene essentiality by an HMM because the statistical methods for predicting gene essentiality do not uniquely use the quantitative value of read counts but use the transition of the read counts across the genome.

      Reference

      Ioerger TR. Analysis of Gene Essentiality from TnSeq Data Using Transit. Methods Mol Biol. 2022 ; 2377: 391–421. doi:10.1007/978-1-0716-1720-5_22.

      DeJesus MA, Ioerger TR (2013) A Hidden Markov Model for identifying essential and 5 growth-defect regions in bacterial genomes from transposon insertion sequencing data. BMC Bioinformatics 14:303 [PubMed: 24103077]

      Website by Ioerger: A Hidden Markov Model for identifying essential and growthdefect regions in bacterial genomes from transposon insertion sequencing data. https://orca1.tamu.edu/essentiality/Tn-HMM/index.html

      Akusobi. C. et al. Transposon-sequencing across multiple Mycobacterium abscessus isolates reveals significant functional genomic diversity among strains. mBio 6, e0337624 (2025).

      DeJesus, M.A. et al. Comprehensive essentiality analysis of the Mycobacterium tuberculosis genome via saturating transposon mutagenesis. mBio 8, e02133-16 (2017).

      Dragset, M.S., et al. Global assessment of Mycobacterium avium subsp. hominissuis genetic requirement for growth and virulence. mSystems 4, e00402-19 (2019). Mendum T.A., et al. Transposon libraries identify novel Mycobacterium bovis BCG genes involved in the dynamic interactions required for BCG to persist during in vivo passage in cattle. BMC Genomics 20, 431 (2019)

      (3) One of the major limitations of this study is the lack of validation of TnSeq results with individual gene knockouts. Authors should mention this in the discussion section.

      Thank you for the comment on the issue of the lack of validation of TnSeq results by using individual knockout mutants. We agree that the lack of validation of TnSeq results is one of the limitations of this study. We have just recently succeeded in constructing the vector plasmids for making knockout mutants of M intracellulare (Tateishi. Microbiol Immunol. 2024). We will proceed to the validation experiment of TnSeq-hit genes by constructing knockout mutants.

      Following the comment, we have added the description in the Discussion (page 35 lines 622-632 in the revised manuscript) as follows: “Furthermore, one of the limitations of this study is the lack of validation of TnSeq results with individual gene knockouts. Contrary to the case of Mtb, the technique of constructing knockout mutants of slow-growing NTM including M. intracellulare has not been established long time. We have just recently succeeded in constructing the vector plasmids for making knockout mutants of M intracellulare (Tateishi. Microbiol Immunol 2024). Growth assay of individual knockout strains of genes showing increased genetic requirements such as pckA, glpX, csd, eccC5 and mycP5 in the clinical strains is suggested to provide the direct involvement of these genes on the 6 preferential hypoxic adaptation in clinical strains. We have a future plan to construct knockout mutants of these genes to confirm the involvement of these genes on preferential hypoxic adaptation.”

      Reference

      Tateishi, Y., Nishiyama, A., Ozeki, Y. & Matsumoto, S. Construction of knockout mutants in Mycobacterium intracellulare ATCC13950 strain using a thermosensitive plasmid containing negative selection marker rpsL + . Microbiol Immunol 68, 339-347 (2024).

      Reviewer #5 (Public review):

      Summary:

      In the research article, "Functional genomics reveals strain-specific genetic requirements conferring hypoxic growth in Mycobacterium intracellulare" Tateshi et al focussed their research on pulmonary disease caused by Mycobacterium avium-intracellulare complex which has recently become a major health concern. The authors were interested in identifying the genetic requirements necessary for growth/survival within host and used hypoxia and biofilm conditions that partly replicate some of the stress conditions experienced by bacteria in vivo. An important finding of this analysis was the observation that genes involved in gluconeogenesis, type VII secretion system and cysteine desulphurase were crucial for the clinical isolates during standard culture while the same were necessary during hypoxia in the ATCC type strain.

      Strength of the study:

      Transposon mutagenesis has been a powerful genetic tool to identify essential genes/pathways necessary for bacteria under various in vitro stress conditions and for in vivo survival. The authors extended the TnSeq methodology not only to the ATCC strain but also to the recently clinical isolates to identify the differences between the two categories of bacterial strains. Using this approach they dissected the similarities and differences in the genetic requirement for bacterial survival between ATCC type strains and clinical isolates. They observed that the clinical strains performed much better in terms of growth during hypoxia than the type strain. These in vitro findings were further extended to mouse 7 infection models and similar outcomes were observed in vivo further emphasising the relevance of hypoxic adaptation crucial for the clinical strains which could be explored as potential drug targets.

      Thank you for the comment on the significance of our manuscript on the basic research of non-tuberculous mycobacteria.

      Weakness:

      The authors have performed extensive TnSeq analysis but fail to present the data coherently. The data could have been well presented both in Figures and text. In my view this is one of the major weakness of the study.

      Thank you for the comment on the issue of data presentation. Our point-by-point response to the Reviewer’s comments is shown below.

      Reviewer #5 (Recommendations for the authors):

      Major comments:

      (1) The result section could have been better organized by splitting into multiple sections with each section focusing on a particular aspect.

      Thank you for the comment on the organization of the section. We have split into multiple sections with each section focusing on a particular aspect as follows:

      (1) Common essential and growth-defect-associated genes representing the genomic diversity of M. intracellulare strains (page 6 lines 102-103 in the revised manuscript)

      (2) The sharing of strain-dependent and accessory essential and growth-defectassociated genes with genes required for hypoxic pellicle formation in the type strain ATCC13950 (page 8 lines 129-131 in the revised manuscript)

      (3) Partial overlap of the genes showing increased genetic requirements in clinical MAC-PD strains with those required for hypoxic pellicle formation in the type strain ATCC13950 (page 9 lines 151-153 in the revised manuscript)

      (4) Minor role of gene duplication on reduced genetic requirements in clinical MACPD strains (page 11 lines 184-185 in the revised manuscript)

      (5) Identification of genes in the clinical MAC-PD strains required for mouse lung infection (page 12 lines 210-211 in the revised manuscript) 8

      (6) Effects of knockdown of universal essential or growth-defect-associated genes in clinical MAC-PD strains (page 17 lines 305-306 in the revised manuscript)

      (7) Differential effects of knockdown of accessory/strain-dependent essential or growth-defect-associated genes among clinical MAC-PD strains (page 19 lines 325- 326 in the revised manuscript)

      (8) Preferential hypoxic adaptation of clinical MAC-PD strains evaluated with bacterial growth kinetics (page 21 lines 365-366 in the revised manuscript)

      (9) The pattern of hypoxic adaptation not simply determined by genotypes (page 22 line 386 in the revised manuscript)

      (2) The different strains that were used in the study, how they were isolated and some information on their genotypes could have been mentioned in brief in the main text and a table of different strains included as a supplementary table

      Thank you for the comment on the information on the clinically isolated strains used in this study. All clinical strains were isolated from sputum of MAC-PD patients (Tateishi. BMC Microbiol. 2021, BMC Microbiol. 2023). Sputum samples were treated by the standard method for clinical isolation of mycobacteria with 0.5% (w/v) Nacetyl-L-cysteine and 2% (w/v) sodium hydroxide and plated on 7H10/OADC agar plates. Single colonies were picked up for use in experiments as isolated strains.

      Following the comment, we have added the description on the information of the strains (page 37 lines 652-660 in the revised manuscript). “All eleven clinical strains from MAC-PD patients in Japan were isolated from sputum (Tateishi. BMC Microbiol 2021, BMC Microbiol 2023). Sputum samples were treated by the standard method for clinical isolation of mycobacteria with 0.5% (w/v) N-acetyl-L-cysteine and 2% (w/v) sodium hydroxide and plated on 7H10/OADC agar. Single colonies were picked up for use in experiments as isolated strains. Of these strains, ATCC13950, M.i.198, M.i.27, M018, M005 and M016 belong to the typical M. intracellulare (TMI) genotype and M001, M003, M019, M021 and MOTT64 belong to the M. paraintracellulare-M. indicus pranii (MP-MIP) genotype (Fig. 1, new Supplementary Table 1)”

      Moreover, we have added the Supplementary Table showing the information on genotypes of each strain and the purpose of the use of study strains as new Supplementary Table 1

      References

      Tateishi, Y. et al. Comparative genomic analysis of Mycobacterium intracellulare: implications for clinical taxonomic classification in pulmonary Mycobacterium aviumintracellulare complex disease. BMC Microbiol 21, 103 (2021). Tateishi, Y. et al. Virulence of Mycobacterium intracellulare clinical strains in a mouse model of lung infection - role of neutrophilic inflammation in disease severity. BMC Microbiol 23, 94 (2023).

      (3) As stated by the previous reviews, an explanation for the variation in the Tn insertion across different strains has not been provided and how they derive conclusions when the Tn frequency was not saturating.

      Thank you for the comment on how to predict gene essentiality from our TnSeq data under the variation in the Tn insertion reads with suboptimal levels of saturation without reaching full saturation of Tn insertion.

      As for the overcome of the Tn insertion variation, we normalized data by using Beta-Geometric correction (BGC), a non-linear normalization method. BGC normalizes the datasets to fit an “ideal” geometric distribution with a variable probability parameter ρ, and BGC improves resampling by reducing the skew. On TRANSIT software, we set the replicate option as Sum to combine read counts. And we normalized the datasets by Beta-Geometric correction (BGC) to reduce variabilities and performed resampling analysis by using normalized datasets to compare the genetic requirements between strains.

      Following the comment, we have explained the variation in the Tn insertion across different strains in the manuscript (pages 39-40, lines 700-708 in the revised manuscript). “The number of Tn insertion in our datasets varied between 1.3 to 5.8 million among strains. To reduce the variation in the Tn insertion across strains, we adopt a non-linear normalization method, Beta-Geometric correction (BGC). BGC normalizes the datasets to fit an “ideal” geometric distribution with a variable probability parameter ρ, and BGC improves resampling by reducing the skew. On TRANSIT software, we set the replicate option as Sum to combine read counts. And we normalized the datasets by BGC and performed resampling analysis by using normalized datasets to compare the genetic requirements between strains.”

      As for the issue of saturation levels of Tn insertion in our Tn mutant libraries, we made a description in the Discussion in the 1st version of the revised manuscript (pages 33-35 lines 592-613 in the 2nd version of the revised manuscript). The saturation of our Tn mutant libraries became 62-79% as follows: ATCC13950: 67.6%, M001: 72.9%, M003: 63.0%, M018: 62.4%, M019: 74.5%, M.i.27: 76.6%, M.i.198: 68.0%, MOTT64: 77.6%, M021: 79.9% by combining replicates. That is, we calculated gene essentiality from the Tn mutant libraries with 62-79% saturation in each strain. The levels of saturation of transposon libraries in our study are similar to the very recent TnSeq anlaysis by Akusobi where 52-80% saturation libraries (so-called “high-density” transposon libraries) are used for HMM and resampling analyses (Supplemental Methods Table 1[merged saturation] in Akusobi. mBio. 2025). The saturation of Tn insertion in individual replicates of our libraries is also comparable to that reported by DeJesus (Table S1 in mBio 2017). Thus, we consider that our TnSeq method of identifying essential genes and detecting the difference of genetic requirements between clinical MAC-PD strains and ATCC13950 is acceptable.

      As for the identification of essential or growth-defect-associated genes by an HMM analysis, we do not consider that we made a serious mistake for the classification of essentiality by an HMM method in most of the structural genes that encode proteins. Because, as DeJesus shows, the number essential genes identified by TnSeq are comparable in large genes possessing more than 10 TA sites between 2 and 14 TnSeq datasets, most of which seem to be structural genes (Supplementary Fig 2 in mBio 2017). If the reviewer intends to regard our libraries far less saturated due to the smaller replicates (n = 2 or 3) than the previous DeJesus’ and Rifat’s reports using 10-14 replicates obtained to acquire so-called “high-density” transposon libraries (DeJesus. mBio 2017, Rifat. mBio 2021), there is a possibility that not all genes could be detected as essential due to the incomplete 11 covering of Tn insertion at nonpermissive TA sites, especially the small genes including small regulatory RNAs. Even if this were the case, it would not detract from the findings of our current study

      As for the identification of genetic requirements by a resampling analysis, we consider that our data is acceptable because we compared the normalized data between strains whose saturation levels are similar to the previous report by Akusobi with “high-density” transposon libraries as mentioned above.

      References

      DeJesus, M.A., Ambadipudi, C., Baker, R., Sassetti, C. & Ioerger, T.R. TRANSIT--A software tool for Himar1 TnSeq analysis. PLoS Comput Biol 11, e1004401 (2015). Akusobi. C. et al. Transposon-sequencing across multiple Mycobacterium abscessus isolates reveals significant functional genomic diversity among strains. mBio 6, e0337624 (2025).

      DeJesus, M.A. et al. Comprehensive essentiality analysis of the Mycobacterium tuberculosis genome via saturating transposon mutagenesis. mBio 8, e02133-16 (2017).

      Rifat, D., Chen L., Kreiswirth, B.N. & Nuermberger, E.L.. Genome-wide essentiality analysis of Mycobacterium abscessus by saturated transposon mutagenesis and deep sequencing. mBio 12, e0104921 (2021).

      (4) ATCC strain is missing in the mouse experiment.

      Thank you for the comment on the necessity of setting ATCC13950 as a control strain of mouse TnSeq experiment. To set ATCC13950 as a control strain in mouse infection experiments would be ideal. However, we have proved that ATCC13950 is eliminated within 4 weeks of infection in mice (Tateishi. BMC Microbiol 2023). To perform TnSeq, it is necessary to collect colonies at least the number of TA sites mathematically (Realistically, colonies with more than the number of TA sites are needed to produce biologically robust data.). That means, it is impossible to perform in vivo TnSeq study using ATCC13950 due to the inability to harvest sufficient number of colonies.

      To make these things understood clearly, we have added the description of being unable to perform in vivo TnSeq in ATCC13950 in the result section (page 13 lines 221-222 in the revised manuscript).

      “(It is impossible to perform TnSeq in lungs infected with ATCC13950 because ATCC13950 is eliminated within 4 weeks of infection) (Tateishi. BMC Microbiol 2023)”

      Reference

      Tateishi, Y. et al. Virulence of Mycobacterium intracellulare clinical strains in a mouse model of lung infection - role of neutrophilic inflammation in disease severity. BMC Microbiol 23, 94 (2023).

      (5) The viability assays done in 96 well plate may not be appropriate given that mycobacterial cultures often clump without vigorous shaking. How did they control evaporation for 10 days and above?

      Thank you for the comment on the issue of viability assay in terms of bacterial clumping. As described in the Methods (page 44 lines 778-781 in the revised manuscript), we have mixed the culture containing 250 μL by pipetting 40 times to loosen clumping every time before sampling 4 μL for inoculation on agar plates to count CFUs. By this method, we did not observe macroscopic clumping or pellicles like of Mtb or M. bovis BCG as seen in statistic culture.

      We used inner wells for culture of bacteria in hypoxic growth assay. To control evaporation of the culture, we filled the distilled water in the outer wells and covered the plates with plastic lids. We cultured the plates with humidification at 37°C in the incubator.

      (6) Fig. 7a many time points have only two data points and in few cases. The Y axis could have been kept same for better comparison for all strains and conditions.

      Thank you for the comments on the data presentation of hypoxic growth assay in original Fig. 7a (new Fig 8a). The reason of many time points with only two data points is the close values of data in individual replicates. For example, the log10- transformed values of CFUs in ATCC13950 under aerobic culture are 4.716, 4.653, 4.698 at day 5, 4.949, 5.056, 4.954 at day 6, and 5.161, 5.190, 5.204 at day 8. We have added the numerical data of CFUs used for drawing growth curves as new Supplementary Table 19. Therefore, the data itself derives from three independent replicates.

      Following the comment, we have revised the data presentation in new Fig 8a (original Fig. 7a) by keeping the same maximal value of Y axis across all graphs. In addition, we have revised the legend to designate clearly how we obtained the data of growth curves as follows (page 63 lines 1107-1108 in the revised manuscript): “Data on the growth curves are the means of three biological replicates from one experiment. Data from one experiment representative of three independent 13 experiments (N = 3) are shown.”

      (7) The relevance of 7b is not well discussed and a suitable explanation for the difference in the profiles of M001 and MOTT64 between aerobic and hypoxia is not provided. Data representation should be improved for 7c with appropriate spacing.

      Thank you for the comments on the relevance of original Fig. 7b (new Fig. 8b). In order to compare the pattern of logarithmic growth curves between strains quantitatively, we focused on time and slope at midpoint. The time at midpoint is the timing of entry to logarithmic growth phase. The earlier the strain enters logarithmic phase, the smaller the value of the time at midpoint becomes.

      The two strains belonging to the MP-MIP subgroup, MOTT64 and M001 showed similar time at midpoint under aerobic conditions. However, the time at midpoint was significantly different between MOTT64 and M001 under hypoxia, the latter showing great delay of timing of entry to logarithmic phase. In contrast to the majority of the clinical strains that showed reduced growth rate at midpoint under hypoxia, neither strain showed such phenomenon under hypoxia. Although the implication in clinical situations has not been proven, strains without slow growth under hypoxia may have different (possibly strain-specific) mechanisms of hypoxic adaptation corresponding to the growth phenotypes under hypoxia.

      Following the comment, we have added the explanation on the difference in the profiles of M001 and MOTT64 between aerobic and hypoxia in the Discussion (page 31 lines 552-557, page 32 lines 562-567 in the revised manuscript). “The two strains belonging to the MP-MIP subgroup, MOTT64 and M001 showed similar time at midpoint under aerobic conditions. However, the time at midpoint was significantly different between MOTT64 and M001 under hypoxia, the latter showing great delay of timing of entry to logarithmic phase. In contrast to the majority of the clinical strains that showed slow growth at midpoint under hypoxia, neither strain showed such phenomenon.”.

      ” Our inability to construct knockdown strains in M001 and MOTT64 prevented us from clarifying the factors that discriminate against the pattern of hypoxic adaptation. Although the implication in clinical situations has not been proven, strains without slow growth under hypoxia may have different (possibly strainspecific) mechanisms of hypoxic adaptation corresponding to the growth phenotypes under hypoxia.”

      Following the comment, we have made the space between new Fig. 8b and 14 new Fig. 8c (original Fig. 7b and Fig. 7c).

      (8) Fig. 8a, the antibiotic sensitivity at early and later time points do not seem to correlate. Any explanation?

      Thank you for the comment on the uncorrelation of data of growth inhibition in knockdown strains of universal essential genes between early and later time points. The diminished effects of growth inhibition observed at Day 7 in knockdown strains may be due to the “escape” clones of knockdown strains under long-term culture by adding anhydrotetracycline (aTc) that induces sgRNA. As described in the Methods (pages 42-43 lines 754-758), we added aTc repeatedly every 48 h to maintain the induction of dCas9 and sgRNAs in experiments that extended beyond 48 h (Singh. Nucl Acid Res 2016). Such phenomenon has been reported by McNeil (Antimicrob Agent Chem. 2019) showing the increase in CFUs by day 9 with 100 ng/mL aTc with bacterial growth being detected between 2 and 3 weeks. These phenotypes of “escape” mutants is considered to be attributed to the promotor responsiveness to aTc.

      Nevertheless, except for gyrB in M.i.27, the effect of growth inhibition at Day 7 in knockdown strains of universal essential genes was 10-1 or less of comparative growth rates of knockdown strains to vector control strains (y-axis of original Fig. 8). In this study, we judged the positive level of growth inhibition as 10-1 or less of comparative growth rates of knockdown strains to vector control strains (y-axis of new Fig. 7). Thus, we consider that the CRISPR-i data overall validated the essentiality of these genes.

      References

      Singh A.K., et al. Investigating essential gene function in Mycobacterium tuberculosis using an efficient CRISPR interference system, Nucl Acid Res 44, e143 (2016) McNeil M.B. &, Cook, G.M. Utilization of CRISPR interference to validate MmpL3 as a drug target in Mycobacterium tuberculosis. Antimicrob Agent Chem 63, e00629-19 (2019)

      (9) Fig. 8b and c very data representation could have been improved. Some strains used in 7 are missing. The authors refer to technical challenge with respect to M001. Is it the same for others as well (MOTT64). The interpretation of data in result and discussion section is difficult to follow. Is the data subjected to statistical analysis?

      Thank you for the comment on data presentation in original Fig. 8b (new Fig 7b). As 15 mentioned in the Discussion (page 18 lines 316-31 in the revised manuscript), the reason of missing M001 and MOTT64 in CRISPR-i experiment in original Fig. 7 (new Fig. 8) was we were unable to construct the knockdown strains in M001 and MOTT64. We consider these are the same technical challenges between M001 and MOTT64.

      Following the comment, we have added the explanation of the technical challenge with respect to M001 and MOTT64 in the Discussion (page 32 lines 561- 566 in the revised manuscript). ”Our inability to construct knockdown strains in M001 and MOTT64 prevented us from clarifying the factors that discriminate against the pattern of hypoxic adaptation. Although the implication in clinical situations has not been proven, strains without slow growth under hypoxia may have different (possibly strain-specific) mechanisms of hypoxic adaptation corresponding to the growth phenotypes under hypoxia.”

      As for the interpretation of growth suppression in knockdown experiments described in original Fig. 8 (new Fig. 7), We judged the positive level of growth inhibition as 10-1 or less of comparative growth rates of knockdown strains to vector control strains (y-axis of new Fig. 7). We interpreted the results based on whether the level of growth inhibition was positive or not (i.e. the comparative growth rates of knockdown strains to vector control strains became below 10-1 or not). Since our aim was to investigate whether knockdown of the target genes in each strain leads to growth inhibition, we did not perform statistical analysis between strains or target genes.

      The major weakness of the study is the organization and data representation. It became very difficult to connect the role of gluconeogenesis, secretion system and others identified by authors to hypoxia, pellicle formation. The authors may consider rephrasing the results and discussion sections.

      Thank you for the comments on the issue of organization and data presentation. Following the comment, we have revised the manuscript to indicate the relevance of the role of gluconeogenesis, secretion system and others defined by us more clearly (page 23 lines 404-408 in the revised manuscript).

      “Because the profiles of genetic requirements reflect the adaptation to the environment in which bacteria habits, it is reasonable to assume that the increase of genetic requirements in hypoxia-related genes such as gluconeogenesis (pckA, glpX), type VII secretion system (mycP5, eccC5) and cysteine desulfurase (csd) play an important role on the growth under hypoxia-relevant conditions in vivo.”

      Following the comments, we have exchanged the order of data presentation as follows: in vitro TnSeq (pages 6-12 lines 102-208 in the revised manuscript) , Mouse TnSeq (pages 12-17 lines 210-303 in the revised manuscript), Knockdown experiment (pages 17-21 lines 305-363 in the revised manuscript), Hypoxic growth assay (pages 21-23 lines 365-408 in the revised manuscript).

      In association with the exchange of the order of data presentation, we have changed the order of the contents of the Discussion as follows: Preferential carbohydrate metabolism under hypoxia such as pckA and glpX (pages 24-26 lines 424-466 in the revised manuscript), Cysteine desulfurase gene (csd) (pages 26-27 lines 467-482 in the revised manuscript), Conditional essential genes in vivo such as type VII secretion system (pages 27-28 lines 483-497 in the revised manuscript), Knockdown experiment (pages 28-30 lines 498-536 in the revised manuscript), Hypoxic growth pattern (pages 30-32 lines 537-571 in the revised manuscript), Failure of assay using PckA inhibitors (pages 32-33 lines 572-578 in the revised manuscript), Transformation efficiencies (page 33 lines 579-591 in the revised manuscript), Saturation of Tn insertion (pages 33-35 lines 592-613 in the revised manuscript), Suggested future experiment plan (pages 35-36 lines 614-632 in the revised manuscript).

    1. Author response:

      Reviewer #1 (Public review):

      (1) It might be good to further discuss potential molecular mechanisms for increasing the TF off rate (what happens at the mechanistic level). 

      This is now expanded in the Discussion

      (2) To improve readability, it would be good to make consistent font sizes on all figures to make sure that the smallest font sizes are readable. 

      We have normalised figure text as much as is feasible.

      (3) upDARs and downDARs - these abbreviations are defined in the figure legend but not in the main text. 

      We have removed references to these terms from the text and included a definition in the figure legend. 

      (4) Figure 3B - the on-figure legend is a bit unclear; the text legend does not mention the meaning of "DEG". 

      We have removed this panel as it was confusing and did not demonstrate any robust conclusion. 

      (5) The values of apparent dissociation rates shown in Figure 5 are a bit different from values previously reported in literature (e.g., see Okamoto et al., 20203, PMC10505915). Perhaps the authors could comment on this. Also, it would be helpful to add the actual equation that was used for the curve fitting to determine these values to the Methods section. 

      We have included an explanation of the curve fitting equation in the Methods as suggested.

      The apparent dissociation rate observed is a sum of multiple rates of decay – true dissociation rate (𝑘<sub>off</sub>), signal loss caused by photobleaching 𝑘<sub>pb</sub>, and signal loss caused by defocusing/tracking error (𝑘<sub>tl</sub>).

      k<sub>off</sub><sup>app</sup>= k<sub>off</sub> + K<sub>pb</sub> + k<sub>tl</sub>

      We are making conclusions about relative changes in k<sub>off</sub><sup>app</sup> upon CHD4 depletion, not about the absolute magnitude of true k<sub>off</sub> or TF residence times. Our conclusions extend to true k<sub>off</sub> based on the assumption that K<sub>pb</sub> and k<sub>tl</sub> are equal across all samples imaged due to identical experimental conditions and analysis.

      K<sub>pb</sub> and k<sub>tl</sub> vary hugely across experimental set-ups, especially with diZerent laser powers, so other k<sub>off</sub> or k<sub>off</sub><sup>app</sup> values reported in the literature would be expected to diZer from ours. Time-lapse experiments or independent determination of K<sub>pb</sub> (and k<sub>tl</sub>) would be required to make any statements about absolute values of k<sub>off</sub>.

      (6) Regarding the discussion about the functionality of low-affinity sites/low accessibility regions, the authors may wish to mention the recent debates on this (https://www.nature.com/articles/s41586-025-08916-0; https://www.biorxiv.org/content/10.1101/2025.10.12.681120v1). 

      We have now included a discussion of this point and referenced both papers.

      (7) It may be worth expanding figure legends a bit, because the definitions of some of the terms mentioned on the figures are not very easy to find in the text. 

      We have endeavoured to define all relevant terms in the figure legends. 

      Reviewer #2 (Public review): 

      (1) Figure 2 shows heat maps of RNA-seq results following a time course of CHD4 depletion (0, 1, 2 hours...). Usually, the red/blue colour scale is used to visualise differential expression (fold-difference). Here, genes are coloured in red or blue even at the 0-hour time point. This confused me initially until I discovered that instead of folddifference, a z-score is plotted. I do not quite understand what it means when a gene that is coloured blue at the 0-hour time point changes to red at a later time point. Does this always represent an upregulation? I think this figure requires a better explanation. 

      The heatmap displays z-scores, meaning expression for each gene has been centred and scaled across the entire time course. As a result, time zero is not a true baseline, it simply shows whether the gene’s expression at that moment is above or below its own mean. A transition from blue to red therefore indicates that the gene increases relative to its overall average, which typically corresponds to upregulation, but it doesn’t directly represent fold-change from the 0-hour time point. We have now included a brief explanation of this in the figure legend to make this point clear.  

      (2) Figure 5D: NANOG, SOX2 binding at the KLF4 locus. The authors state that the enhancers 68, 57, and 55 show a gain in NANOG and SOX2 enrichment "from 30 minutes of CHD4 depletion". This is not obvious to me from looking at the figure. I can see an increase in signal from "WT" (I am assuming this corresponds to the 0 hours time point) to "30m", but then the signals seem to go down again towards the 4h time point. Can this be quantified? Can the authors discuss why TF binding seems to increase only temporarily (if this is the case)? 

      We have edited the text to more accurately reflect what is going on in the screen shot. We have also replaced “WT” with “0” as this more accurately reflects the status of these cells. 

      (3) The is no real discussion of HOW CHD4/NuRD counteracts TF binding (i.e. by what molecular mechanism). I understand that the data does not really inform us on this. Still, I believe it would be worthwhile for the authors to discuss some ideas, e.g., local nucleosome sliding vs. a direct (ATP-dependent?) action on the TF itself. 

      We now include more speculation on this point in the Discussion.

      Reviewer #3 (Public review): 

      The main weakness can be summarised as relating to the fact that authors interpret all rapid changes following CHD4 degradation as being a direct effect of the loss of CHD4 activity. The possibility that rapid indirect effects arise does not appear to have been given sufficient consideration. This is especially pertinent where effects are reported at sites where CHD4 occupancy is initially low. 

      We acknowledge that we cannot definitively say any effect is a direct consequence of CHD4 depletion and have mitigated statements in the Results and Discussion. 

      Reviewing Editor Comments: 

      I am pleased to say all three experts had very complementary and complimentary comments on your paper - congratulations. Reviewer 3 does suggest toning down a few interpretations, which I suggest would help focus the manuscript on its greater strengths. I encourage a quick revision to this point, which will not go back to reviewers, before you request a version of record. I would also like to take this opportunity to thank all three reviewers for excellent feedback on this paper. 

      As advised we have mitigated the points raised by the reviewers.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      Dong et al. present an in-depth analysis of mutant phenotypes of the Rab GTPases Rab5, Rab7, and Rab11 in Drosophila second-order olfactory neuron development. These three Rab GTPases are amongst the best-characterized Rab GTPases in eukaryotes and have been associated with major roles in early endosomes, late endosomes, and recycling endosomes, respectively. All three have been investigated in Drosophila neurons before; however, this study provides the most detailed characterization and comparison of mutant phenotypes for axonal and dendritic development of fly projection neurons to date. In addition, the authors provide excellent high-resolution data on the distribution of each of the three Rabs in developmental analyses.

      Strengths:

      The strength of the work lies in the detailed characterization and comparison of the different Rab mutants on projection neuron development, with clear differences for the three Rabs and by inference for the early, late, and recycling endosomal functions executed by each.

      We would like to thank Reviewer #1 for their appreciation of our characterization of distinct Rab mutants.

      Weaknesses:

      Some weakness derives from the fact that Rab5, Rab7, and Rab11 are, as acknowledged by the authors, somewhat pleiotropic, and their actual roles in projection neuron development are not addressed beyond the characterization of (mostly adult) mutant phenotypes and developmental expression.

      Prior to mid-pupal stage (around 48 hours after puparium formation), glomeruli in the antennal lobe have not yet assumed their stereotyped positions, which complicates analyses and interpretation; thus, many of our analyses are conducted at the adult stage. For Rab11 mutants we did perform many developmental analyses to evaluate the origins of the axonal development (Figure 6—figure supplement 1) and dendrite elaboration phenotypes (Figure 5 J–L) we observed at the adult stage. We realize that the development axonal analyses are in supplemental material where they could be missed. Given the reviewer’s comments, we will move these data to the main figures.

      Further, we will extend our Rab5 analyses to evaluate the function of this protein during development in experiments we will add to the revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      This study by Dong et al. characterizes the roles of highly-expressed Rab GTPases Rab5, Rab7, and Rab11 in the development and wiring of olfactory projection neurons in Drosophila. This convincing descriptive study provides complementary approaches to Rab expression and localization profiling, conventional dominant-negative mutants, and clonal loss-of-function mutants to address the roles of different endosomal trafficking pathways across circuit development. They show distinct distributions and phenotypes for different Rabs. Overall, the study sets the stage for future mechanistic studies in this well-defined central neuron.

      We appreciate Reviewer #2’s analysis of our work and thank them for their suggestions to improve the clarity of our manuscript.

      Strengths:

      Beautiful imaging in central neurons demonstrates differential roles of 3 key Rab proteins in neuronal morphogenesis, as well as interesting patterns of subcellular endosome distribution. These descriptions will be critical for future mechanistic studies. The cell biology is well-written and explanatory, very accessible to a wide audience without sacrificing technical accuracy.

      Weaknesses:

      The Drosophila manipulations require more explanation in the main text to reach a wide audience.

      In our revised manuscript we will clarify the fly-specific manipulations and terminology to make our work more accessible to a broader audience.  

      Reviewer #3 (Public review):

      Summary:

      The authors aimed at a comprehensive phenotypic characterization of the roles of all Rab proteins expressed in PN neurons in the developing Drosophila olfactory system. Important data are shown for a number of these Rabs with small/no phenotypes (in the Supplements) as well as the main endosomal Rabs, Rab5, 7, and 11 in the main figures.

      We appreciate Reviewer #3’s assessment and appreciation of our work.

      Strengths:

      The mosaic analysis is a great strength, allowing visualization of small clones or single neuron morphologies. This also allows some assessment of the cell autonomy of the observed phenotypes. The impact of the work lies in the comprehensiveness of the experiments. The rescue experiments are a strength.

      Weaknesses:

      The main weakness is that the experiments do not address the mechanisms that are affected by the loss of these Rab proteins, especially in terms of the most significant cargos. The insights thus do not extend far beyond what is already known from other work in many systems.

      We understand this critique and are also interested in the specific cargos regulated by each Rab during development. We attempted to use antibodies to evaluate changes in cell-surface protein localization in response to disrupting individual Rabs but were unable to reliably distinguish(?) shifts in association with specific endosomal compartments. Many available antibodies label cell-surface proteins expressed in antennal lobe cells beyond projection neurons (such as olfactory receptor neurons, glia, or local interneurons) which complicates analyses. Further, although we have produced multiple ‘flp-on’ tags for PN cell-surface proteins, they cannot be used with the MARCM system. This prevents us from simultaneously perturbing individual Rabs and tracking corresponding changes in surface-protein localization with single cell resolution. Moreover, for proteins that are not highly endocytosed, it is difficult to separate plasma-membrane from endosomal localization, and we currently do not know which cell-surface proteins are most robustly endocytosed. Thus, while we share the reviewer’s interest in identifying candidate cargos, technological limitations make it difficult to achieve this goal within the scope of the current study.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary

      This work provides important new evidence of the cognitive and neural mechanisms that give rise to feelings of shame and guilt, as well as their transformation into compensatory behavior. The authors use a well-designed interpersonal task to manipulate responsibility and harm, eliciting varying levels of shame and guilt in participants. The study combines behavioral, computational, and neuroimaging approaches to offer a comprehensive account of how these emotions are experienced and acted upon. Notably, the findings reveal distinct patterns in how harm and responsibility contribute to guilt and shame and how these factors are integrated into compensatory decision-making.

      Strengths

      (1) Investigating both guilt and shame in a single experimental framework allows for a direct comparison of their behavioral and neural effects while minimizing confounds.

      (2) The study provides a novel contribution to the literature by exploring the neural bases underlying the conversion of shame into behavior.

      (3) The task is creative and ecologically valid, simulating a realistic social situation while retaining experimental control.

      (4) Computational modeling and fMRI analysis yield converging evidence for a quotient-based integration of harm and responsibility in guiding compensatory behavior.

      We are grateful for your thoughtful summary of our work’s strengths and greatly appreciate these positive words.

      We would like to note that, in accordance with the journal’s requirements, we have uploaded both a clean version of the revised manuscript and a version with all modifications highlighted in blue.

      Weakness

      (1) Post-experimental self-reports rely both on memory and on the understanding of the conceptual difference between the two emotions. Additionally, it is unclear whether the 16 scenarios were presented in random order; sequential presentation could have introduced contrast effects or demand characteristics.

      Thank you for pointing out the two limitations of the experimental paradigm. We fully agree with your point. Participants recalled and reported their feelings of guilt and shame immediately after completing the task, which likely ensured reasonably accurate state reports. We acknowledge, however, that in-task assessments might provide greater precision. We opted against them to examine altruistic decision-making in a more natural context, as in-task assessments could have heightened participants’ awareness of guilt and shame and biased their altruistic decisions. Post-task assessments also reduced fMRI scanning time, minimizing discomfort from prolonged immobility and thereby preserving data quality.

      In the present study, assessing guilt and shame required participants to distinguish conceptually between the two emotions. Most research with adult participants has adopted this approach, relying on direct self-reports of emotional intensity under the assumption that adults can differentiate between guilt and shame (Michl et al., 2014; Wagner et al., 2011; Zhu et al., 2019). However, we acknowledge that this approach may be less suitable for studies involving children, who may not yet have a clear understanding of the distinction between guilt and shame.

      The limitations have been added into the Discussion section (Page 47): “This research has several limitations. First, post-task assessments of guilt and shame, unlike in-task assessments, rely on memory and may thus be less precise, although in-task assessments could have heightened participants’ awareness of these emotions and biased their decisions. Second, our measures of guilt and shame depend on participants’ conceptual understanding of the two emotions. While this is common practice in studies with adult participants (Michl et al., 2014; Wagner et al., 2011; Zhu et al., 2019), it may be less appropriate for research involving children.”

      We apologize for the confusion. The 16 scenarios were presented in a random order. We have clarified this in the revised manuscript (Page 13): “After the interpersonal game, the outcomes of the experimental trials were re-presented in a random order.”

      (2) In the neural analysis of emotion sensitivity, the authors identify brain regions correlated with responsibility-driven shame sensitivity and then use those brain regions as masks to test whether they were more involved in the responsibility-driven shame sensitivity than the other types of emotion sensitivity. I wonder if this is biasing the results. Would it be better to use a cross-validation approach? A similar issue might arise in "Activation analysis (neural basis of compensatory sensitivity)." 

      Thank you for this valuable comment. We replaced the original analyses with a leave-one-subject-out (LOSO) cross-validation approach, which minimizes bias in secondary tests due to non-independence (Esterman et al., 2010). The findings were largely consistent with the original results, except that two previously significant effects became marginally significant (one effect changed from P = 0.012 to P = 0.053; the other from P = 0.044 to P = 0.062). Although we believe the new results do not alter our main conclusions, marginally significant findings should be interpreted with caution. We have noted this point in the Discussion section (Page 48): “… marginally significant results should be viewed cautiously and warrant further examination in future studies with larger sample sizes.”

      In the revised manuscript, we have described the cross-validation procedure in detail and reported the corresponding results. Please see the Method section, Page 23: “The results showed that the neural responses in the temporoparietal junction/superior temporal sulcus (TPJ/STS) and precentral cortex/postcentral cortex/supplementary motor area (PRC/POC/SMA) were negatively correlated with the responsibility-driven shame sensitivity. To test whether these regions were more involved in responsibilitydriven shame sensitivity than in other types of emotion sensitivity, we implemented a leave-one-subject-out (LOSO) cross-validation procedure (e.g., Esterman et al., 2010). In each fold, clusters in the TPJ/STS and PRC/POC/SMA showing significant correlations with responsibility-driven shame sensitivity were identified at the group level based on N-1 participants. These clusters, defined as regions of interest (ROI), were then applied to the left-out participant, from whom we extracted the mean parameter estimates (i.e., neural response values). If, in a given fold, no suprathreshold cluster was detected within the TPJ/STS or PRC/POC/SMA after correction, or if the two regions merged into a single cluster that could not be separated, the corresponding value was coded as missing. Repeating this procedure across all folds yielded an independent set of ROI-based estimates for each participant. In the LOSO crossvalidation procedure, the TPJ/STS and PRC/POC/SMA merged into a single inseparable cluster in two folds, and no suprathreshold cluster was detected within the TPJ/STS in one fold. These instances were coded as missing, resulting in valid data from 39 participants for the TPJ/STS and 40 participants for the PRC/POC/SMA. We then correlated these estimates with all four types of emotion sensitivities and compared the correlation with responsibility-driven shame sensitivity against those with the other sensitivities using Z tests (Pearson and Filon's Z).” and Page 24: “To directly test whether these regions were more involved in one of the two types of compensatory sensitivity, we applied the same LOSO cross-validation procedure described above. In this procedure, no suprathreshold cluster was detected within the LPFC in one fold and within the TP in 27 folds. These cases were coded as missing, resulting in valid data from 42 participants for the bilateral IPL, 41 participants for the LPFC, and 15 participants for the TP. The limited sample size for the TP likely reflects that its effect was only marginally above the correction threshold, such that the reduced power in cross-validation often rendered it nonsignificant. Because the sample size for the TP was too small and the results may therefore be unreliable, we did not pursue further analyses for this region. The independent ROI-based estimates were then correlated with both guilt-driven and shame-driven compensatory sensitivities, and the strength of the correlations was compared using Z tests (Pearson and Filon's Z).”

      Please see the Results section, Pages 34 and 35: “To assess whether these brain regions were specifically involved in responsibility-driven shame sensitivity, we compared the Pearson correlations between their activity and all types of emotion sensitivities. The results demonstrated the domain specificity of these regions, by revealing that the TPJ/STS cluster had significantly stronger negative responses to responsibility-driven shame sensitivity than to responsibility-driven guilt sensitivity (Z = 2.44, P = 0.015) and harm-driven shame sensitivity (Z = 3.38, P < 0.001), and a marginally stronger negative response to harm-driven guilt sensitivity (Z = 1.87, P = 0.062) (Figure 4C; Supplementary Table 14). In addition, the sensorimotor areas (i.e., precentral cortex (PRC), postcentral cortex (POC), and supplementary motor area (SMA)) exhibited the similar activation pattern as the TPJ/STS (Figure 4B and 4C; Supplementary Tables 13 and 14).” and Page 35: “The results revealed that the left LPFC was more engaged in shame-driven compensatory sensitivity (Z = 1.93, P = 0.053), as its activity showed a marginally stronger positive correlation with shamedriven sensitivity than with guilt-driven sensitivity (Figure 5C). No significant difference was found in the Pearson correlations between the activity of the bilateral IPL and the two types of sensitivities (Supplementary Table 16). For the TP, the effective sample size was too small to yield reliable results (see Methods).”

      (1) Regarding the traits of guilt and shame, I appreciate using the scores from the subscales (evaluations and action tendencies) separately for the analyses (instead of a composite score). An issue with using the actions subscales when measuring guilt and shame proneness is that the behavioral tendencies for each emotion get conflated with their definitions, risking circularity. It is reassuring that the behavior evaluation subscale was significantly correlated with compensatory behavior (not only the action tendencies subscale). However, the absence of significant neural correlates for the behavior evaluation subscale raises questions: Do the authors have thoughts on why this might be the case, and any implications?

      We are grateful for this important comment. According to the Guilt and Shame Proneness Scale, trait guilt comprises two dimensions: negative behavior evaluations and repair action tendencies (Cohen et al., 2011). Behaviorally, both dimensions were significantly correlated with participants’ compensatory behavior (negative behavior evaluations: R = 0.39, P = 0.010; repair action tendencies: R = 0.33, P = 0.030). Neurally, while repair action tendencies were significantly associated with activity in the aMCC and other brain areas, negative behavior evaluations showed no significant neural correlates. The absence of significant neural correlates for negative behavior evaluations may be due to several factors. In addition to common explanations (e.g., limited sample size reducing the power to detect weak neural correlates or subtle effects obscured by fMRI noise), another possibility is that this dimension influences neural responses indirectly through intermediate processes not captured in our study (e.g., specific motivational states). We have added a discussion of the non-significant result to the revised manuscript (Page 47): “However, the neural correlates of negative behavior evaluations (another dimension of trait guilt) were absent. The reasons underlying the non-significant neural finding may be multifaceted. One possibility is that negative behavior evaluations influence neural responses indirectly through intermediate processes not captured in our study (e.g., specific motivational states).”

      In addition, to avoid misunderstanding, the revised manuscript specifies at the appropriate places that the neural findings pertain to repair action tendencies rather than to trait guilt in general. For instance, see Pages 46 and 47: “Furthermore, we found neural responses in the aMCC mediated the relationship between repair action tendencies (one dimension of trait guilt) and compensation… Accordingly, our fMRI findings suggest that individuals with stronger tendency to engage in compensation across various moral violation scenarios (indicated by their repair action tendencies) are more sensitive to the severity of the violation and therefore engage in greater compensatory behavior.”

      (2) Regarding the computational model finding that participants seem to disregard selfinterest, do the authors believe it may reflect the relatively small endowment at stake? Do the authors believe this behavior would persist if the stakes were higher?

      Additionally, might the type of harm inflicted (e.g., electric shock vs. less stigmatized/less ethically charged harm like placing a hand in ice-cold water) influence the weight of self-interest in decision-making?

      Taken together, the conclusions of the paper are well supported by the data. It would be valuable for future studies to validate these findings using alternative tasks or paradigms to ensure the robustness and generalizability of the observed behavioral and neural mechanisms.

      Thank you for these important questions. As you suggested, we believe that the relatively small personal stakes in our task (a maximum loss of 5 Chinese yuan) likely explain why the computational model indicated that participants disregarded selfinterest. We also agree that when the harm to others is less morally charged, people may be more inclined to consider self-interest in compensatory decision-making. Overall, the more stigmatized the harm and the smaller the personal stakes, the more likely individuals are to disregard self-interest and focus solely on making appropriate compensation.

      We have added the following passage to the Discussion section (Page 42): “Notably, in many computational models of social decision-making, self-interest plays a crucial role (e.g., Wu et al., 2024). However, our computational findings suggest that participants disregarded self-interest during compensatory decision-making. A possible explanation is that the personal stakes in our task were relatively small (a maximum loss of 5 Chinese yuan), whereas the harm inflicted on the receiver was highly stigmatized (i.e., an electric shock). Under conditions where the harm is highly salient and the cost of compensation is low, participants may be inclined to disregard selfinterest and focus solely on making appropriate compensation.”

      Reviewer #2 (Public review):

      Summary

      The authors combined behavioral experiments, computational modeling, and functional magnetic resonance imaging (fMRI) to investigate the psychological and neural mechanisms underlying guilt, shame, and the altruistic behaviors driven by these emotions. The results revealed that guilt is more strongly associated with harm, whereas shame is more closely linked to responsibility. Compared to shame, guilt elicited a higher level of altruistic behavior. Computational modeling demonstrated how individuals integrate information about harm and responsibility. The fMRI findings identified a set of brain regions involved in representing harm and responsibility, transforming responsibility into feelings of shame, converting guilt and shame into altruistic actions, and mediating the effect of trait guilt on compensatory behavior.

      Strengths

      This study offers a significant contribution to the literature on social emotions by moving beyond prior research that typically focused on isolated aspects of guilt and shame. The study presents a comprehensive examination of these emotions, encompassing their cognitive antecedents, affective experiences, behavioral consequences, trait-level characteristics, and neural correlates. The authors have introduced a novel experimental task that enables such a systematic investigation and holds strong potential for future research applications. The computational modeling procedures were implemented in accordance with current field standards. The findings are rich and offer meaningful theoretical insights. The manuscript is well written, and the results are clearly and logically presented.

      We are thankful for your considerate acknowledgment of our work’s strengths and truly value your positive comments.

      We would like to note that, in accordance with the journal’s requirements, we have uploaded both a clean version of the revised manuscript and a version with all modifications highlighted in blue.

      Weakness

      In this study, participants' feelings of guilt and shame were assessed retrospectively, after they had completed all altruistic decision-making tasks. This reliance on memorybased self-reports may introduce recall bias, potentially compromising the accuracy of the emotion measurements.

      Thank you for this crucial comment. We fully agree that measuring guilt and shame after the task may affect accuracy to some extent. However, because participants reported their emotions immediately after completing the task, we believe their recollections were reasonably accurate. In designing the experiment, we considered intask assessments, but this approach risked heightening participants’ awareness of guilt and shame and thereby interfering with compensatory decisions. After careful consideration, we ultimately chose post-task assessments of these emotions. A similar approach has been adopted in prior research on gratitude, where post-task assessments were also used (Yu et al., 2018).

      In the revised manuscript, we have specified the limitations of both post-task and intask assessments of guilt and shame (Page 47): “… post-task assessments of guilt and shame, unlike in-task assessments, rely on memory and may thus be less precise, although in-task assessments could have heightened participants’ awareness of these emotions and biased their decisions.”.

      In many behavioral economic models, self-interest plays a central role in shaping individual decision-making, including moral decisions. However, the model comparison results in this study suggest that models without a self-interest component (such as Model 1.3) outperform those that incorporate it (such as Model 1.1 and Model 1.2). The authors have not provided a satisfactory explanation for this counterintuitive finding. 

      Thank you for this important comment. In the revised manuscript, we have provided a possible explanation (Page 42): “Notably, in many computational models of social decision-making, self-interest plays a crucial role (e.g., Wu et al., 2024). However, our computational findings suggest that participants disregarded self-interest during compensatory decision-making. A possible explanation is that the personal stakes in our task were relatively small (a maximum loss of 5 Chinese yuan), whereas the harm inflicted on the receiver was highly stigmatized (i.e., an electric shock). Under conditions where the harm is highly salient and the cost of compensation is low, participants may be inclined to disregard self-interest and focus solely on making appropriate compensation.”

      The phrases "individuals integrate harm and responsibility in the form of a quotient" and "harm and responsibility are integrated in the form of a quotient" appear in the Abstract and Discussion sections. However, based on the results of the computational modeling, it is more accurate to state that "harm and the number of wrongdoers are integrated in the form of a quotient." The current phrasing misleadingly suggests that participants represent information as harm divided by responsibility, which does not align with the modeling results. This potentially confusing expression should be revised for clarity and accuracy.

      We sincerely thank you for this helpful suggestion and apologize for the confusion caused. We have removed expressions such as “harm and responsibility are integrated in the form of a quotient” from the manuscript. Instead, we now state more precisely that “harm and the number of wrongdoers are integrated in the form of a quotient.”

      However, in certain contexts we continue to discuss harm and responsibility. Introducing “the number of wrongdoers” in these places would appear abrupt, so we have opted for alternative phrasing. For example, on Page 3, we now write:

      “Computational modeling results indicated that the integration of harm and responsibility by individuals is consistent with the phenomenon of responsibility diffusion.” Similarly, on Page 49, we state: “Notably, harm and responsibility are integrated in a manner consistent with responsibility diffusion prior to influencing guilt-driven and shame-driven compensation.”

      In the Discussion, the authors state: "Since no brain region associated with social cognition showed significant responses to harm or responsibility, it appears that the human brain encodes a unified measure integrating harm and responsibility (i.e., the quotient) rather than processing them as separate entities when both are relevant to subsequent emotional experience and decision-making." However, this interpretation overstates the implications of the null fMRI findings. The absence of significant activation in response to harm or responsibility does not necessarily imply that the brain does not represent these dimensions separately. Null results can arise from various factors, including limitations in the sensitivity of fMRI. It is possible that more finegrained techniques, such as intracranial electrophysiological recordings, could reveal distinct neural representations of harm and responsibility. The interpretation of these null findings should be made with greater caution.

      Thank you for this reminder. In the revised manuscript, we have provided a more cautious interpretation of the results (Page 43): “Although the fMRI findings revealed that no brain region associated with social cognition showed significant responses to harm or responsibility, this does not suggest that the human brain encodes only a unified measure integrating harm and responsibility and does not process them as separate entities. Using more fine-grained techniques, such as intracranial electrophysiological recordings, it may still be possible to observe independent neural representations of harm and responsibility.”

      Reviewer #3 (Public review):

      Summary

      Zhu et al. set out to elucidate how the moral emotions of guilt and shame emerge from specific cognitive antecedents - harm and responsibility - and how these emotions subsequently drive compensatory behavior. Consistent with their prediction derived from functionalist theories of emotion, their behavioral findings indicate that guilt is more influenced by harm, whereas shame is more influenced by responsibility. In line with previous research, their results also demonstrate that guilt has a stronger facilitating effect on compensatory behavior than shame. Furthermore, computational modeling and neuroimaging results suggest that individuals integrate harm and responsibility information into a composite representation of the individual's share of the harm caused. Brain areas such as the striatum, insula, temporoparietal junction, lateral prefrontal cortex, and cingulate cortex were implicated in distinct stages of the processing of guilt and/or shame. In general, this work makes an important contribution to the field of moral emotions. Its impact could be further enhanced by clarifying methodological details, offering a more nuanced interpretation of the findings, and discussing their potential practical implications in greater depth.

      Strengths

      First, this work conceptualizes guilt and shame as processes unfolding across distinct stages (cognitive appraisal, emotional experience, and behavioral response) and investigates the psychological and neural characteristics associated with their transitions from one stage to the next.

      Second, the well-designed experiment effectively manipulates harm and responsibility - two critical antecedents of guilt and shame.

      Third, the findings deepen our understanding of the mechanisms underlying guilt and shame beyond what has been established in previous research.

      We truly appreciate your acknowledgment of our work’s strengths and your encouraging feedback.

      We would like to note that, in accordance with the journal’s requirements, we have uploaded both a clean version of the revised manuscript and a version with all modifications highlighted in blue.

      Weakness

      Over the course of the task, participants may gradually become aware of their high error rate in the dot estimation task. This could lead them to discount their own judgments and become inclined to rely on the choices of other deciders. It is unclear whether participants in the experiment had the opportunity to observe or inquire about others' choices. This point is important, as the compensatory decision-making process may differ depending on whether choices are made independently or influenced by external input.

      Thank you for pointing this out. We apologize for not making the experimental procedure sufficiently clear. Participants (as deciders) were informed that each decider performed the dot estimation independently and was unaware of the estimations made by the other deciders. We now have clarified this point in the revised manuscript (Pages 10 and 11): “Each decider indicated whether the number of dots was more than or less than 20 based on their own estimation by pressing a corresponding button (dots estimation period, < 2.5 s) and was unaware of the estimations made by other deciders”.

      Given the inherent complexity of human decision-making, it is crucial to acknowledge that, although the authors compared eight candidate models, other plausible alternatives may exist. As such, caution is warranted when interpreting the computational modeling results.

      Thank you for this comment. We fully agree with your opinion. Although we tried to build a conceptually comprehensive model space based on prior research and our own understanding, we did not include all plausible models, nor would it be feasible to do so. We acknowledge it as a limitation in the revised manuscript (Page 47): “... although we aimed to construct a conceptually comprehensive computational model space informed by prior research and our own understanding, it does not encompass all plausible models. Future research is encouraged to explore additional possibilities.”

      I do not agree with the authors' claim that "computational modeling results indicated that individuals integrate harm and responsibility in the form of a quotient" (i.e., harm/responsibility). Rather, the findings appear to suggest that individuals may form a composite representation of the harm attributable to each individual (i.e., harm/the number of people involved). The explanation of the modeling results ought to be precise.

      We appreciate your comment and apologize for the imprecise description. In the revised manuscript, we now use the expressions “… integrate harm and the number of wrongdoers in the form of a quotient.” and “… the integration of harm and responsibility by individuals is consistent with the phenomenon of responsibility diffusion.” For example, on Page 19, we state: “It assumes that individuals neglect their self-interest, have a compensatory baseline, and integrate harm and the number of wrongdoers in the form of a quotient.” On Page 3, we state: “Computational modeling results indicated that the integration of harm and responsibility by individuals is consistent with the phenomenon of responsibility diffusion.”

      Many studies have reported positive associations between trait gratitude, social value orientation, and altruistic behavior. It would be helpful if the authors could provide an explanation about why this study failed to replicate these associations.

      Thanks a lot for this important comment. We have now added an explanation into the revised manuscript (Page 47): “Although previous research has found that trait gratitude and SVO are significantly associated with altruistic behavior in contexts such as donation (Van Lange et al., 2007; Yost-Dubrow & Dunham, 2018) and reciprocity (Ma et al., 2017; Yost-Dubrow & Dunham, 2018), their associations with compensatory decisions in the present study were not significant. This suggests that the effects of trait gratitude and SVO on altruistic behavior are context-dependent and may not predict all forms of altruistic behavior.”

      As the authors noted, guilt and shame are closely linked to various psychiatric disorders. It would be valuable to discuss whether this study has any implications for understanding or even informing the treatment of these disorders.

      We are grateful for this advice. Although our study did not directly examine patients with psychological disorders, the findings offer insights into the regulation of guilt and shame. As these emotions are closely linked to various disorders, improving their regulation may help alleviate related symptoms. Accordingly, we have added a paragraph highlighting the potential clinical relevance (Pages 48 and 49): “Our study has potential practical implications. The behavioral findings may help counselors understand how cognitive interventions targeting perceptions of harm and responsibility could influence experiences of guilt and shame. The neural findings highlight specific brain regions (e.g., TPJ) as potential intervention targets for regulating these emotions. Given the close links between guilt, shame, and various psychological disorders (e.g., Kim et al., 2011; Lee et al., 2001; Schuster et al., 2021), strategies to regulate these emotions may contribute to symptom alleviation. Nevertheless, because this study was conducted with healthy adults, caution is warranted when considering applications to other populations.”

      Reviewer #1 (Recommendations for the authors):

      (1) Would it be interesting to explore other categories of behavior apart from compensatory behavior?

      Thanks a lot for this insightful question. We focused on a classic form of altruistic behavior, compensation. Future studies are encouraged to adapt our paradigm to examine other behaviors associated with guilt and/or shame, such as donation (Xu, 2022), avoidance (Shen et al., 2023), or aggression (Velotti et al., 2014). Please see Page 48: “Future research could combine this paradigm with other cognitive neuroscience methods, such as electroencephalography (EEG) or magnetoencephalography (MEG), and adapt it to investigate additional behaviors linked to guilt and shame, including donation (Xu, 2022), avoidance (Shen et al., 2023), and aggression (Velotti et al., 2014).”

      (2) Did the computational model account for the position of the block (slider) at the start of each decision-making response (when participants had to decide how to divide the endowment)? Or are anchoring effects not relevant/ not a concern?

      Thank you for this interesting question. In our task, the initial position of the slider was randomized across trials, and participants were explicitly informed of this in the instructions. This design minimized stable anchoring effects across trials, as participants could not rely on a consistent starting point. Although anchoring might still have influenced individual trial responses, we believe it is unlikely that such effects systematically biased our results, since randomization would tend to cancel them out across trials. Additionally, prior research has shown that when multiple anchors are presented, anchoring effects are reduced if the anchors contradict each other (Switzer

      III & Sniezek, 1991). Therefore, we did not attempt to model potential anchoring effects. Nevertheless, future research could systematically manipulate slider starting positions to directly examine possible anchoring influences. In the revised manuscript, we have added a brief clarification (Page 11): “The initial position of the block was randomized across trials, which helped minimize stable anchoring effects across trials.”

      (3) Was there a real receiver who experienced the shocks and received compensation? I think it is not completely clear in the paper.

      We are sorry for not making this clear enough. The receiver was fictitious and did not actually exist. We have supplemented the Methods section with the following description (Page 12): “We told the participant a cover story that the receiver was played by another college student who was not present in the laboratory at the time. … In fact, the receiver did not actually exist.”.

      (4) What was the rationale behind not having participants meet the receiver?

      Thank you for this question. Having participants meet the receiver (i.e., the victim), played by a confederate, might have intensified their guilt and shame and produced a ceiling effect. In addition, the current approach simplified the experimental procedure and removed the need to recruit an additional confederate. These reasons have been added to the Methods section (Page 12): “Not having participants meet the receiver helped prevent excessive guilt and shame that might produce a ceiling effect, while also eliminating the need to recruit an additional confederate.”

      Minor edits:

      (1) Line 49: "the cognitive assessment triggers them", I think a word is missing.

      (2) Line 227: says 'Slide' instead of 'Slider'.

      (3) Lines 867/868: "No brain response showed significant correlation with responsibility-driven guilt sensitivity, harm-driven shame sensitivity, or responsibilitydriven shame sensitivity." I think it should be harm-driven guilt sensitivity, responsibility-driven guilt sensitivity, and harm-driven shame sensitivity.

      (4) Supplementary Information Line 12: I think there is a typo ( 'severs' instead of 'serves')

      We sincerely thank you for patiently pointing out these typos. We have corrected them accordingly. 

      (1) “the cognitive assessment triggers them” has been revised to “the cognitive antecedents that trigger them” (Page 2).

      (2) “SVO Slide Measure” has been revised to “SVO Slider Measure” (Page 8).

      (3) “No brain response showed significant correlation with responsibility-driven guilt sensitivity, harm-driven shame sensitivity, or responsibility-driven shame sensitivity." has been revised to “No brain response showed significant correlation with harm-driven guilt sensitivity, responsibility-driven guilt sensitivity, and harm-driven shame sensitivity.” (Page 35).

      (4) “severs” has been revised to “serves” (see Supplementary Information). In addition, we have carefully checked the entire manuscript to correct any remaining typographical errors.

      Reviewer #2 (Recommendations for the authors):

      The statement that trait gratitude and SVO were measured "for exploratory purposes" would benefit from further clarification regarding the specific questions being explored.

      Thank you for this valuable suggestion. In the revised manuscript, we have illustrated the exploratory purposes (Page 9): “We measured trait gratitude and SVO for exploratory purposes. Previous research has shown that both are linked to altruistic behavior, particularly in donation contexts (Van Lange et al., 2007; Yost-Dubrow & Dunham, 2018) and reciprocity contexts (Ma et al., 2017; Yost-Dubrow & Dunham, 2018). Here, we explored whether they also exert significant effects in a compensatory context.”

      In the Methods section, the authors state: "To confirm the relationships between κ and guilt-driven and shame-driven compensatory sensitivities, we calculated the Pearson correlations between them." However, the Results section reports linear regression results rather than Pearson correlation coefficients, suggesting a possible inconsistency. The authors are advised to carefully check and clarify the analysis approach used.

      We thank you for the careful reviewing and apologize for this mistake. We used a linear mixed-effects regression instead of Pearson correlations for the analysis. The mistake has been revised (Page 25): “To confirm the relationships between κ and guiltdriven and shame-driven compensatory sensitivities, we conducted a linear mixedeffects regression. κ was regressed onto guilt-driven and shame-driven compensatory sensitivities, with participant-specific random intercepts and random slopes for each fixed effect included as random effects.”

      A more detailed discussion of how the current findings inform the regulation of guilt and shame would further strengthen the contribution of this study.

      Thank you for this suggestion. We have added a paragraph discussing the implications for the regulation of guilt and shame (Pages 48 and 49): “Our study has potential practical implications. The behavioral findings may help counselors understand how cognitive interventions targeting perceptions of harm and responsibility could influence experiences of guilt and shame. The neural findings highlight specific brain regions (e.g., TPJ) as potential intervention targets for regulating these emotions. Given the close links between guilt, shame, and various psychological disorders (e.g., Kim et al., 2011; Lee et al., 2001; Schuster et al., 2021), strategies to regulate these emotions may contribute to symptom alleviation. Nevertheless, because this study was conducted with healthy adults, caution is warranted when considering applications to other populations.”

      As fMRI provides only correlational evidence, establishing a causal link between neural activity and guilt- or shame-related cognition and behavior would require brain stimulation or other intervention-based methods. This may represent a promising direction for future research.

      Thank you for this advice. We also agree that it is important for future research to establish the causal relationships between the observed brain activity, psychological processes, and behavior. We have added a corresponding discussion in the revised manuscript (Pages 47 and 48): “… fMRI cannot establish causality. Future studies using brain stimulation techniques (e.g., transcranial magnetic stimulation) are needed to clarify the causal role of brain regions in guilt-driven and shame-driven altruistic behavior.”

      Reviewer #3 (Recommendations for the authors):

      It was mentioned that emotions beyond guilt and shame, such as indebtedness, may also drive compensation. Were any additional types of emotion measured in the study?

      Thank you for this question. We did not explicitly measure emotions other than guilt and shame. However, the parameter κ from our winning computational model captures the combined influence of various psychological processes on compensation, which may reflect the impact of emotions beyond guilt and shame (e.g., indebtedness). We acknowledge that measuring other emotions similar to guilt and shame may help to better understand their distinct contributions. This point has been added into the revised manuscript (Page 48): “… we did not explicitly measure emotions similar to guilt and shame (e.g., indebtedness), which would have been helpful for understanding their distinct contributions.”

      The experimental task is complicated, raising the question of whether participants fully understood the instructions. For instance, one participant's compensation amount was zero. Could this reflect a misunderstanding of the task instructions?

      Thanks a lot for this question. In our study, after reading the instructions, participants were required to complete a comprehension test on the experimental rules. If they made any mistakes, the experimenter provided additional explanations. Only after participants fully understood the rules and correctly answered all comprehension questions did they proceed to the main experimental task. We have clarified this procedure in the revised manuscript (Page 13): “Participants did not proceed to the interpersonal game until they had fully understood the experimental rules and passed a comprehension test.”

      Making identical choices across different trials does not necessarily indicate that participants misunderstood the rules. Similar patterns, where participants made the same choices across trials, have also been observed in previous studies (Zhong et al., 2016; Zhu et al., 2021).

      Reference

      Cohen, T. R., Wolf, S. T., Panter, A. T., & Insko, C. A. (2011). Introducing the GASP scale: a new measure of guilt and shame proneness. Journal of Personality and Social Psychology, 100(5), 947–966. https://doi.org/10.1037/a0022641

      Esterman, M., Tamber-Rosenau, B. J., Chiu, Y. C., & Yantis, S. (2010). Avoiding nonindependence in fMRI data analysis: Leave one subject out. NeuroImage, 50(2), 572–576. https://doi.org/10.1016/j.neuroimage.2009.10.092

      Kim, S., Thibodeau, R., & Jorgensen, R. S. (2011). Shame, guilt, and depressive symptoms: A meta-analytic review. Psychological Bulletin, 137(1), 68. https://doi.org/10.1037/a0021466

      Lee, D. A., Scragg, P., & Turner, S. (2001). The role of shame and guilt in traumatic events: A clinical model of shame-based and guilt-based PTSD. British Journal of Medical Psychology, 74(4), 451–466. https://doi.org/10.1348/000711201161109

      Ma, L. K., Tunney, R. J., & Ferguson, E. (2017). Does gratitude enhance prosociality?: A meta-analytic review. Psychological Bulletin, 143(6), 601–635. https://doi.org/10.1037/bul0000103

      Michl, P., Meindl, T., Meister, F., Born, C., Engel, R. R., Reiser, M., & Hennig-Fast, K. (2014). Neurobiological underpinnings of shame and guilt: A pilot fMRI study. Social Cognitive and Affective Neuroscience, 9(2), 150–157.

      Schuster, P., Beutel, M. E., Hoyer, J., Leibing, E., Nolting, B., Salzer, S., Strauss, B., Wiltink, J., Steinert, C., & Leichsenring, F. (2021). The role of shame and guilt in social anxiety disorder. Journal of Affective Disorders Reports, 6, 100208. https://doi.org/10.1016/j.jadr.2021.100208

      Shen, B., Chen, Y., He, Z., Li, W., Yu, H., & Zhou, X. (2023). The competition dynamics of approach and avoidance motivations following interpersonal transgression. Proceedings of the National Academy of Sciences, 120(40), e2302484120. https://doi.org/10.1073/pnas.230248412

      Switzer III, F. S., & Sniezek, J. A. (1991). Judgment processes in motivation: Anchoring and adjustment effects on judgment and behavior. Organizational Behavior and Human Decision Processes, 49(2), 208–229. https://doi.org/10.1016/0749-5978(91)90049-Y

      Van Lange, P. A. M., Bekkers, R., Schuyt, T. N. M., & Van Vugt, M. (2007). From games to giving: Social value orientation predicts donations to noble causes. Basic and Applied Social Psychology, 29(4), 375–384. https://doi.org/10.1080/01973530701665223

      Velotti, P., Elison, J., & Garofalo, C. (2014). Shame and aggression: Different trajectories and implications. Aggression and Violent Behavior, 19(4), 454–461. https://doi.org/10.1016/j.avb.2014.04.011

      Wagner, U., N’Diaye, K., Ethofer, T., & Vuilleumier, P. (2011). Guilt-specific processing in the prefrontal cortex. Cerebral Cortex, 21(11), 2461–2470. https://doi.org/10.1093/cercor/bhr016

      Wu, X., Ren, X., Liu, C., & Zhang, H. (2024). The motive cocktail in altruistic behaviors. Nature Computational Science, 4, 659–676. https://doi.org/10.1038/s43588-024-00685-6

      Xu, J. (2022). The impact of guilt and shame in charity advertising: The role of self- construal. Journal of Philanthropy and Marketing, 27(1). https://doi.org/10.1002/nvsm.1709

      Yost-Dubrow, R., & Dunham, Y. (2018). Evidence for a relationship between trait gratitude and prosocial behaviour. Cognition and Emotion, 32(2), 397–403. https://doi.org/10.1080/02699931.2017.1289153

      Yu, H., Gao, X., Zhou, Y., & Zhou, X. (2018). Decomposing gratitude: Representation and integration of cognitive antecedents of gratitude in the brain. Journal of Neuroscience, 38(21), 4886–4898. https://doi.org/10.1523/JNEUROSCI.2944-17.2018

      Zhong, S., Chark, R., Hsu, M., & Chew, S. H. (2016). Computational substrates of social norm enforcement by unaffected third parties. NeuroImage, 129, 95–104. https://doi.org/10.1016/j.neuroimage.2016.01.040

      Zhu, R., Feng, C., Zhang, S., Mai, X., & Liu, C. (2019). Differentiating guilt and shame in an interpersonal context with univariate activation and multivariate pattern analyses. NeuroImage, 186, 476486. https://doi.org/10.1016/j.neuroimage.2018.11.012

      Zhu, R., Xu, Z., Su, S., Feng, C., Luo, Y., Tang, H., Zhang, S., Wu, X., Mai, X., & Liu, C. (2021). From gratitude to injustice: Neurocomputational mechanisms of gratitude-induced injustice. NeuroImage, 245, 118730. https://doi.org/10.1016/j.neuroimage.2021.118730

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this review, the author covered several aspects of the inflammation response, mainly focusing on the mechanisms controlling leukocyte extravasation and inflammation resolution.

      Strengths:

      This review is based on an impressive number of sources, trying to comprehensively present a very broad and complex topic.

      Weaknesses:

      (1) This reviewer feels that, despite the title, this review is quite broad and not centred on the role of the extracellular matrix.

      Since this review focuses on the whole extravasation journey of leukocyte, this topic is definitely quite broad and covers several related fields. The article highlights the involvement of extracellular matrices (ECM), which are important regulators in multiple phases of the process, as a common theme to thread together these related topics. In the revised manuscript, we have made further emphasis on the role of specific ECM where appropriate (see point 2 below) and reorganized the last section to fit to this theme (see point 3 below).

      (2) The review will benefit from a stronger focus on the specific roles of matrix components and dynamics, with more informative subheadings.

      ECM may exert their roles either as a collective structure or as individual components. In the latter case, though the concerned ECM are specifically named throughout the manuscript, they may not be sufficiently obvious since they were often not mentioned in subheadings. For sections discussing functions of a specific ECM protein or at least a specific class of ECM proteins, we have now included their names in the subheadings as well for clarity (section 5 and 8). For other sections discussing functions that involve ECM as a macrostructure, either in form of vascular basement membrane to enable force generation or contributing to the overall tissue stiffness to provide biophysical cues (section 7, 9-10), we have included the specific processes regulated in the subheadings like that in section 4.

      In the newly added discussion about the effects of matrikines on lymphocytes, we have also focused on the roles of specific ECM (PGP and versican; line 396-408). We hope these measures have made the subheadings more informative and provided better clarity of the roles of specific ECM components.

      (3) The macrophage phenotype section doesn't seem well integrated with the rest of the review (and is not linked to the ECM).

      Section 10-11 concerns how macrophage phenotypes affect the tissue fate following inflammation, that is, either to resolve inflammation and regenerate damages incurred or to sustain inflammation. This fate decision is an important aspect of this review: By furthering our understanding on the processes and mechanisms involved, we hope to gain the capability to properly control tissue outcomes in inflammatory diseases.

      In section 10, an emphasis is put on macrophage efferocytosis, for its documented efficiency to resolve tissue inflammation. Specific ECM components (type-V collagens and 𝑎2-laminins) could directly promote macrophage efferocytosis (line 494-499). On the other hand, changes in tissue stiffness, as a result of ECM turnover regulated by activities of leukocytes or other cell types like fibroblasts as described in section 9, also affects efferocytosis (line 504-507).

      We acknowledge that section 11 does not integrate well to the rest of the review, this section is now restructured. First, we describe how the ECM-regulated efferocytosis may be leveraged in disease modulation (line 522-529) and the need for a unified system to describe macrophage states for disease modulation (line 527-533) such that the responsible cell states for producing ECM regulators / effectors can be clarified (line 533-535). Given means to control macrophage cell states, this clarification will be useful to modulate pathologies involving ECM malfunctioning, that might be hinted by emergence or expansion of those responsible macrophage states in pathology (line 577-579, 581-585). Next, we provide historic background of efforts to establish such a unified descriptive platform for macrophage states (line 538-548) and describe the recent solution offered by MIKA. MIKA is a pan-tissue archive for tissue macrophage cell states based on meta-analysis of published single-macrophage transcriptomes, we have described the establishment, the latest development (Supplementary Data 1-4) and how the complex tissue macrophage states are segmented to core and tissue-specific identities under this framework (line 548-560, Figure 5A). Under this identity framework, expression of different ECM regulators discussed in this review (either the ECM per se, fibroblastic growth factors or proteases or protease inhibitors that regulate ECM turnover or matrikine production) are examined and linked to specific macrophage identities to offer insights of their potential relevance in pathologies (line 561-586, Figure 5B).

      (4) Table 1 is difficult to follow. It could be reformatted to facilitate reading and understanding

      We apologize for the complex setup. Table 1 is now reformatted to horizontal orientation to have enough space for the columns and reorganized for much easier comprehension.

      (5) Figure 2 appears very complex and broad.

      The original Figure 2 is now split to 2 separate figures (Figure 3-4). Since many processes of diverse natures influence tissue decision of resolution/inflammation, Figure 3 serves to outline and summarise these processes. Figure 4 now focuses on the regulation and tissue-resolving roles of macrophage efferocytosis, which specific ECM components (type-V collagens and 2-laminins) or tissue stiffness contribute to acquisition of this cell state. We hope this split can better focus the messages and ease understanding.

      (6) Spelling and grammar should be thoroughly checked to improve the readability.

      The manuscript is now proofread again, with corrections made throughout the text.

      Reviewer #2 (Public review):

      Summary:

      The manuscript is a timely and comprehensive review of how the extracellular matrix (ECM), particularly the vascular basement membrane, regulates leukocyte extravasation, migration, and downstream immune function. It integrates molecular, mechanical, and spatial aspects of ECM biology in the context of inflammation, drawing from recent advances. The framing of ECM as an active instructor of immune cell fate is a conceptual strength.

      Strengths:

      (1) Comprehensive synthesis of ECM functions across leukocyte extravasation and post-transmigration activity.

      (2) Incorporation of recent high-impact findings alongside classical literature.

      (3) Conceptually novel framing of ECM as an active regulator of immune function.

      (4) Effective integration of molecular, mechanical, and spatial perspectives.

      Weaknesses:

      (1) Insufficient narrative linkage between the vascular phase (Sections 2-6) and the in-tissue phase (Sections 7-10).

      A transition paragraph between these two phases is now added between Section 6 and Section 7 to provide a narrative that ECM interaction events during extravasation affect downstream leukocyte functions (line 300-307).

      (2) Underrepresentation of lymphocyte biology despite mention in early sections.

      Although lymphocytes follow a similar extravasation principle as described in earlier sections, their in-tissue activities differ much from innate leukocytes. Discussion of crosstalk amongst T cells, innate leukocytes and matrikines is now incorporated into section 8 (line 396-408). Functional effects of tissue stiffness on different T cell subsets are now discussed in section 9 (line 456-469).

      (3) The MIKA macrophage identity framework is only loosely tied to ECM mechanisms.

      The involved section 11 is now restructured to better integrate to the ECM topics with the associated Figure 3 changed to Figure 5. Specifically, under the MIKA framework, we have now linked specific macrophage identities to expression / production of ECM functional effectors or regulators discussed in this review to highlight their regulatory roles and potential relevance in pathologies. Reviewer #1 and #3 also have raised this issue, please refer to the response to point (3) of reviewer #1 for detailed description.

      (4) Limited discussion of translational implications and therapeutic strategies.

      Besides translational implications or therapeutic strategies included in the original manuscript (line 291-298, 375-377, 421-424, 427-429, 508-511, 512-516 of the current manuscript), we have now included additional discussion to enrich these aspects (line 356-358, line 396-398, 402-403, 428, 436-439, 467-469, 523-536, 579-586).

      (5) Overly dense figure insets and underdeveloped links between ECM carryover and downstream immune phenotypes.

      The original Figure 1 containing the insets is now split to Figure 1-2 to avoid too dense information fitting to a single figure and to better focus the message in each figure. To resolve the issue of overly dense insets, insets in Figure 1 are redrawn/ reorganized. The original Figure 1C is moved to Figure 2A. The inset showing platelet plugging, together with the issue of diapedesis overloading described in the original Figure 1B, is reorganized to Figure 2B. In this way, Figure 1 focuses on the vascular barrier organization, overview of extravasation, and the force related events during endothelial junctional remodelling. Figure 2 focuses on the low expression regions, and junctional sealing processes after diapedesis.

      We have now expanded discussion on ECM carryovers and their reported or implicated effects on downstream leukocyte functions (line 329-335).

      (6) Acronyms and some mechanistic details may limit accessibility for a broader readership.

      A glossary explaining specialized terms that may be confusing to readers of different fields is now included as Appendix 1 to broaden accessibility (line 977).

      Reviewer #3 (Public review):

      Summary & Strengths:

      This review by Yu-Tung Li sheds new light on the processes involved in leukocyte extravasation, with a focus on the interaction between leukocytes and the extracellular matrix. In doing so, it presents a fresh perspective on the topic of leukocyte extravasation, which has been extensively covered in numerous excellent reviews. Notably, the role of the extracellular matrix in leukocyte extravasation has received relatively little attention until recently, with a few exceptions, such as a study focusing on the central nervous system (J Inflamm 21, 53 (2024) doi.org/10.1186/s12950-024-00426-6) and another on transmigration hotspots (J Cell Sci (2025) 138 (11): jcs263862 doi.org/10.1242/jcs.263862). This review synthesizes the substantial knowledge accumulated over the past two decades in a novel and compelling manner.

      The author dedicates two sections to discussing the relevant barriers, namely, endothelial cell-cell junctions and the basement membrane. The following three paragraphs address how leukocytes interact with and transmigrate through endothelial junctions, the mechanisms supporting extravasation, and how minimal plasma leakage is achieved during this process. The subsequent question of whether the extravasation process affects leukocyte differentiation and properties is original and thought-provoking, having received limited consideration thus far. The consequences of the interaction between leukocytes and the extracellular matrix, particularly regarding efferocytosis, macrophage polarization, and the outcome of inflammation, are explored in the subsequent three chapters. The review concludes by examining tissue-specific states of macrophage identity.

      Weaknesses:

      Firstly, the first ten sections provide a comprehensive overview of the topic, presenting logical and well-formulated arguments that are easily accessible to a general audience. In stark contrast, the final section (Chapter 11) fails to connect coherently with the preceding review and is nearly incomprehensible without prior knowledge of the author's recent publication in Cell. Mol. Life Sci. CMLS 772 82, 14 (2024). This chapter requires significantly more background information for the general reader, including an introduction to the Macrophage Identity Kinetics Archive (MIKA), which is not even introduced in this review, its basis (meta-analysis of published scRNA-seq data), its significance (identification of major populations), and the reasons behind the revision of the proposed macrophage states and their further development.

      The issue of section 11 being not well-integrated to the rest of the review has also been pointed out by other reviewers. In response, this section and the associated Figure 3 are now restructured for better integration to the theme of ECM. In brief, we have now discussed the regulatory roles of specific macrophage identities under the MIKA framework on the ECM regulators described in this review. Please refer to the response to point (3) of reviewer #1 for further details.

      Regarding the difficulties in understanding the MIKA framework without prior knowledge of our previous work, first, we thank the reviewer for pointing out this issue and for making suggestion to better introduce the framework in a way easy to comprehend. Accordingly, in the current structure of section 11, we have described the rationales behind the needs of a common descriptive platform for tissue macrophage states (line 523-536), previous historic efforts (line 538-548), have introduced MIKA with mentions of the establishment and significance (line 548-555), and also have explained the rationales behind further development (line 555-560).

      Secondly, while the attempt to integrate a vast amount of information into fewer figures is commendable, it results in figures that resemble a complex puzzle. The author may consider increasing the number of figures and providing additional, larger "zoom-in" panels, particularly for the topics of clot formation at transmigration hotspots and the interaction between ECM/ECM fragments and integrins. Specifically, the color coding (purple for leukocyte α6-integrins, blue for interacting laminins, also blue for EC α6 integrins, and red for interacting 5-1-1 laminins) is confusing, and the structures are small and difficult to recognize.

      We apologize for the figures being too dense. Other reviewers have also raised this issue (see response to point (5) of reviewer #2 and response to point (5) of reviewer #1). The original Figure 1 and 2 are now reorganized to Figure 1-2 and 3-4 respectively, with insets also redrawn / expanded. Figure 1 now focuses on the vascular barrier organization, overview of extravasation, and the force related events during endothelial junctional remodelling. Figure 2 focuses on the low expression regions, and junctional sealing processes after diapedesis. Figure 3 serves to outline and summarise the diverse processes influencing tissue decision of resolution/inflammation. Figure 4 focuses on the regulation and tissue-resolving roles of macrophage efferocytosis. The original Figure 3, mainly concerning the methodological aspects of update of MIKA, is now integrated to Supplementary Data 1. This figure is now replaced as Figure 5 concerning the specific macrophage identities producing ECM effectors / regulators discussed in this review.

      The concerned colour-coding issue is now in Figure 2A. All integrins are now in sky blue and all laminins in red. VE-Cad is also in red but has a different size and shape than laminins. We hope these modifications have improved the figures avoiding confusion.

      Recommendations for the authors:

      As you will see, the reviewers thought your manuscript was interesting and timely. However, as part 11 and its corresponding Figure 3 seem somewhat detached from the rest of the manuscript, one recommendation would be to remove this part for improved clarity. Other recommendations can be found in the comments below.

      Reviewer #2 (Recommendations for the authors):

      (1) Improve narrative linkage between vascular extravasation (Sections 2-6) and in-tissue leukocyte activities (Sections 7-10) by adding explicit transition text that connects ECM changes during transmigration to downstream immune cell phenotypes.

      A transition paragraph is now added between section 6 and 7 (line 300-307).

      (2) Expand discussion of lymphocyte-ECM interactions, either within existing sections or as a dedicated subsection.

      We have now added discussion of the effects of matrikine on in vivo T cell traffic (line 396-409) and how T cell functions are regulated by tissue stiffness (line 457-466).

      (3) Strengthen integration of the MIKA macrophage identity framework with ECM-specific drivers (e.g., stiffness, matrikines) and reduce methodological detail in Fig. 3 to focus on biological relevance.

      We thank the reviewer for this recommendation and have adopted accordingly. First, the methodological details in the original Fig.3 is now integrated to Supplementary Data 1. This figure is now replaced as Fig.5 serving to examine different macrophage identities’ contribution to ECM effectors / regulators (specifically, ECM per se, growth factors for ECM-producing fibroblasts, proteases and protease inhibitors) discussed in earlier sections. Relevant texts are on line 561-586.

      (4) Consider adding a glossary of key terms (e.g., matrikines, efferocytosis) to aid accessibility.

      A glossary explaining selected terms that may be confusing to the general readership is now added as Appendix 1 (line 977).

      Reviewer #3 (Recommendations for the authors):

      The discussion of fibrosis as a significant consequence of inflammatory activity is currently limited to skin keloids and bleomycin-induced lung fibrosis. Considering the substantial clinical relevance, it would be beneficial to include a mention of the various forms of liver fibrosis resulting from chronic inflammation.

      Liver cirrhosis is now mentioned as further examples of stiffening tissues on line 428, 436-439.

      While the manuscript is generally well-written, there are several minor language issues that could be easily addressed by a native speaker during revisions. Some examples are listed below:

      We thank the reviewer for these very helpful suggestions. They are adopted with the relevant line number in the revised manuscript indicated below. In addition, the manuscript is proofread again, with other grammatical mistakes corrected throughout the text.

      (1) Line 40: ... proliferative pathogen, can be timely eliminated.

      line 40

      (2) Line 79: It may be worthwhile pointing out that while Claudin 5 expression is highest in the BBB, it is also relevant in the BRB and expressed at lower levels in peripheral ECs. Similarly, ZO-1 is widely found to be expressed in peripheral endothelial cells.

      Thanks for indicating this caution, it is now mentioned on line 79-82.

      (3) Line 82: affects leukocyte traffic and...

      line 84

      (4) Line 125: ..., both neutrophil and lymphocyte extravasation were reduced by ~60%

      line 125-126

      5) Line 128: The term "paracellular endothelial junction" is odd, as junctions are per se paracellular, i.e., between cells.

      line 129

      (6) Line 147: ... VE-Cadherin, in which the FRET signal vanishes.

      line 148

      (7) Line 186: "activation by direct leukocyte pressing" might be rephrased to be clearer, e.g. "it might as well be activated by mechanical force exerted by leukocytes like it is the case for Piezo-1."

      line 185-186

      (8) Line 216: The phrasing "knockout analogy" is somewhat unfortunate. I would suggest "...a4 ko mice consequently largely lack a5 low expression regions and the resulting reduction in leukocyte extravasation confirms the facilitating role of the low a5 expression regions."

      line 217-218

      (9) Line 219: ...how the low expression regions form / are formed in the first place... The term construction implies active planning.

      line 220

      (10) Line 278: ... thrombocytopenic mice ...

      line 279

      (11) Line 294: ... use platelets as a drug delivery vehicle ...

      line 295

      (12) Line 304: instead of "could have changed", use "might change"

      line 315

      (13) Line 320: at the level of the monocyte

      line 336-337

      (14) Line 324: ... consistent with ...

      line 340

      (15) Line 335: ... progenitors

      line 351

      (16) Line 432: ... a considerable number of apoptotic neutrophils has (been) accumulated

      line 480

      (17) Line 442: ..., which promote killing responses, cross activate other leukocytes ..., or reduce tissue availability...

      line 490-491

      (18) Line 453: ...This macrophage is responsive to BMP...

      This sentence is now rephrased on line 500-501.

      (19) Line 454: ...involved in forming S1 macrophages.

      line 502

      (20) Line 476: ...numerous pathologies...

      Points (20-22) concerns Section 11, which is now restructured (line 523-586).

      21) Line 492: ...macrophages acquiring phenotypes specific to their residence tissue.

      (22) Line 498: ...either - the tissue macrophage is of heterogeneous nature... or - tissue macrophages are of heterogeneous nature...

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review)::

      Summary:

      The work used open peer reviews and followed them through a succession of reviews and author revisions. It assessed whether a reviewer had requested the author include additional citations and references to the reviewers' work. It then assessed whether the author had followed these suggestions and what the probability of acceptance was based on the authors decision.

      Strengths and weaknesses:

      The work's strengths are the in-depth and thorough statistical analysis it contains and the very large dataset it uses. The methods are robust and reported in detail. However, this is also a weakness of the work. Such thorough analysis makes it very hard to read! It's a very interesting paper with some excellent and thought provoking references but it needs to be careful not to overstate the results and improve the readability so it can be disseminated widely. It should also discuss more alternative explanations for the findings and, where possible, dismiss them.

      I have toned down the language including a more neutral title. To help focus on the main results, I have moved four paragraphs from the methods to the supplement. These are the sample size, the two sensitivity analyses on including co-reviewers and confounding by reviewers’ characteristics, and the analysis examining potential bias for the reviewers with no OpenAlex record.

      Reviewer #2 (Public review):

      Summary:

      This article examines reviewer coercion in the form of requesting citations to the reviewer's own work as a possible trade for acceptance and shows that, under certain conditions, this happens.

      Strengths:

      The methods are well done and the results support the conclusions that some reviewers "request" self-citations and may be making acceptance decisions based on whether an author fulfills that request.

      Weaknesses:

      The author needs to be more clear on the fact that, in some instances, requests for selfcitations by reviewers is important and valuable.

      This is a key point. I have included a new text analysis to examine this issue and have addressed this in the updated discussion.

      Reviewer #3 (Public review):

      Summary:

      In this article, Barnett examines a pressing question regarding citing behavior of authors during the peer review process. In particular, the author studies the interaction between reviewers and authors, focusing on the odds of acceptance, and how this may be affected by whether or not the authors cited the reviewers' prior work, whether the reviewer requested such citations be added, and whether the authors complied/how that affected the reviewer decision-making.

      Strengths:

      The author uses a clever analytical design, examining four journals that use the same open peer review system, in which the identities of the authors and reviewers are both available and linkable to structured data. Categorical information about the approval is also available as structured data. This design allows a large scale investigation of this question.

      Weaknesses:

      My concerns pertain to the interpretability of the data as presented and the overly terse writing style.

      Regarding interpretability, it is often unclear what subset of the data are being used both in the prose and figures. For example, the descriptive statistics show many more Version 1 articles than Version 2+. How are the data subset among the different possible methods?

      I have now included the number of articles and reviews in the legends of each plot. There are more version 1 articles because some are “approved” at this stage and hence a second version is never submitted (I’ve now specifically mentioned this in the discussion).

      Likewise, the methods indicate that a matching procedure was used comparing two reviewers for the same manuscript in order to control for potential confounds. However, the number of reviews is less than double the number of Version 1 articles, making it unclear which data were used in the final analysis. The methods also state that data were stratified by version. This raises a question about which articles/reviews were included in each of the analyses. I suggest spending more space describing how the data are subset and stratified. This should include any conditional subsetting as in the analysis on the 441 reviews where the reviewer was not cited in Version 1 but requested a citation for Version 2. Each of the figures and tables, as well as statistics provided in the text should provide this information, which would make this paper much more accessible to the reader.

      [Note from editor: Please see "Editorial feedback" for more on this]

      The numbers are now given in every figure legend, and show the larger sample size for the first versions.

      The analysis of the 441 reviews was an unplanned analysis that is separate to the planned models. The sample size is much smaller than the main models due to the multiple conditions applied to the reviewers: i) reviewed both versions, ii) not cited in first version, iii) requested a self-citation in their first review.

      Finally, I would caution against imputing motivations to the reviewers, despite the important findings provided here. This is because the data as presented suggest a more nuanced interpretation is warranted. First, the author observes similar patterns of accept/reject decisions whether the suggested citation is a citation to the reviewer or not (Figs 3 and 4). Second, much of the observed reviewer behavior disappears or has much lower effect sizes depending on whether "Accept with Reservations" is considered an Accept or a Reject. This is acknowledged in the results text, but largely left out of the discussion. The conditional analysis on the 441 reviews mentioned above does support a more cautious version of the conclusion drawn here, especially when considered alongside the specific comments left by reviewers that were mentioned in the results and information in Table S.3. However, I recommend toning the language down to match the strength of the data.

      I have used more cautious language throughout, including a new title. The new text analysis presented in the updated version also supports a more cautious approach.

      Reviewer #4 (Public review):

      Summary:

      This work investigates whether a citation to a referee made by a paper is associated with a more positive evaluation by that referee for that paper. It provides evidence supporting this hypothesis. The work also investigates the role of self citations by referees where the referee would ask authors to cite the referee's paper.

      Strengths:

      This is an important problem: referees for scientific papers must provide their impartial opinions rooted in core scientific principles. Any undue influence due to the role of citations breaks this requirement. This work studies the possible presence and extent of this.

      Barring a few issues discussed below, the methods are solid and well done. The work uses a matched pair design which controls for article-level confounding and further investigates robustness to other potential confounds.

      It is surprising that even in these investigated journals where referee names are public, there is prevalence of such citation-related behaviors.

      Weaknesses:

      Some overall claims are questionable:

      "Reviewers who were cited were more likely to approve the article, but only after version 1" It also appears that referees who were cited were less likely to approve the article in version 1. This null or slightly negative effect undermines the broad claim of citations swaying referees. The paper highlights only the positive results while not including the absence (and even reversal) of the effect in version 1 in its narrative.

      The reversed effect for version 1 is interesting, but the adjusted 99.4% confidence interval includes 1 and hence it’s hard to be confident that this is genuinely in the reverse direction. However, it is certainly far from the strongly positive association for versions 2+.

      "To the best of our knowledge, this is the first analysis to use a matched design when examining reviewer citations" Does not appear to be a valid claim based on the literature reference [18]

      This previous paper used a matched design but then did not used a matched analysis. Hence, I’ve changed the text in my paper to “first analysis to use a matched design and analysis”. This may seem a minor claim of novelty, but not using a matched analysis for matched data could discard much of the benefits of the matching.

      It will be useful to have a control group in the analysis associated to Figure 5 where the control group comprises matched reviews that did not ask for a self citation. This will help demarcate words associated with approval under self citation (as compared to when there is no self citation). The current narrative appears to suggest an association of the use of these words with self citations but without any control.

      Thanks for this useful suggestion. I have added a control group of reviewers who requested citations to articles other than their own. The words requested were very similar to the previous analysis, hence I’ve needed to reinterpret the results from the text analysis as “please” and “need” are not exclusively used by those requesting selfcitations. I also fixed a minor error in the text analysis concerning the exclusion of abstracts of shorter than 100 characters.

      More discussion on the recommendations will help:

      For the suggestion that "the reviewers initially see a version of the article with all references blinded and no reference list" the paper says "this involves more administrative work and demands more from peer reviewers". I am afraid this can also degrade the quality of peer review, given that the research cannot be contextualized properly by referees. Referees may not revert back to all their thoughts and evaluations when references are released afterwards.

      This is an interesting point, but I don’t think it’s certain that this would happen. For example, revisiting the review may provide a fresh perspective and new ideas; this sometimes happens for me when I review the second version of an article. Ideally an experiment is needed to test this approach, as it is difficult to predict how authors and reviewers will react.

      Recommendations for the Authors:

      Editorial feedback:

      I wonder if the article would benefit from a shorter title, such as the one suggested below. However, please feel free to not change the title if you prefer.

      [i] Are peer reviewers influenced by their work being cited (or not)?

      I like the slightly simpler: “Are peer reviewers influenced by their work being cited?”

      [ii] To better reflect the findings in the article, please revise the abstract along the following lines:

      Peer reviewers for journals sometimes write that one or more of their own articles should have been cited in the article under review. In some cases such comments are justified, but in other cases they are not. Here, using a sample of more than 37000 peer reviews for four journals that use open peer review and make all article versions available, we use a matched study design to explore this and other phenomena related to citations in the peer review process. We find that reviewers who were cited in the article under review were less likely to approve the original version of an article compared with reviewers who were not cited (odds ratio = 0.84; adjusted 99.4% CI: 0.69-1.03), but were more likely to approve a revised article in which they were cited (odds ratio = 1.61; adjusted 99.4% CI: 1.16-2.23). Moreover, for all versions of an article, reviewers who asked for their own articles to be cited were much less likely to approve the article compared with reviewers who did not do this (odds ratio = 0.15; adjusted 99.4% CI: 0.08-0.30). However, reviewers who had asked for their own articles to be cited were much more likely to approve a revised article that cited their own articles compared to a revised article that did not (odds ratio = 3.5; 95% CI: 2.0-6.1).

      I have re-written the abstract along the lines suggested. I have not included the finding that cited reviewers were less likely to approve the article due to the adjusted 99.4% interval including 1.

      [iii] The use of the phrase "self-citation" to describe an author citing an article by one of the reviewers is potentially confusing, and I suggest you avoid this phrase if possible.

      I have removed “self-citation” everywhere and instead used “citations to their own articles”.

      [iv] I think the captions for figures 2, 3 and 4 from benefit from rewording to more clearly describe what is being shown in the figure. Please consider revising the caption for figure 2 as follows, and revising the captions for figures 3 and 4 along similar lines. Please also consider replotting some of the panels so that the values on the horizontal axes of the top panel align with the values on the bottom panel.

      I have aligned the odds and probability axes as suggested which better highlights the important differences. I have updated the figure captions as outlined.

      Figure 2: Odds ratios and probabilities for reviewers giving a more or less favourable recommendation depending on whether they were cited in the article.

      Top left: Odds ratios for reviewers giving a more favourable (Approved) or less favourable (Reservations or Not approved) recommendation depending on whether they were cited in the article. Reviewers who were cited in version 1 of the article (green) were less likely to make a favourable recommendation (odds ratio = 0.84; adjusted 99.4% CI: 0.691.03), but they were more likely to make a favourable recommendation (odds ratio = 1.61; adjusted 99.4% CI: 1.16-2.23) if they were cited in a subsequent version (blue). Top right: Same data as top left displayed in terms of probabilities. From the top, the lines show the probability of a reviewer approving: a version 1 article in which they are not cited (please give mean value and CI); a version 1 article in which they are cited (mean value and CI); a version 2 (or higher) article in which they are not cited (mean value and CI); and a version 2 (or higher) article in which they are cited (mean value and CI).

      Bottom left: Same data as top left except that more favourable is now defined as Approved or Reservations, and less favourable is defined as Not approved. Again, reviewers who were cited in version 1 were less likely to make a favourable recommendation (odds ratio = 0.84; adjusted 99.4% CI: 0.57-1.23),and reviewers who were cited in subsequent versions were more likely to make a favourable recommendation (odds ratio = 1.12; adjusted 99.4% CI: 0.59-2.13).

      Bottom right: Same data as bottom left displayed in terms of probabilities. From the top, the lines show the probability of a reviewer approving: a version 1 article in which they are not cited (please give mean value and CI); a version 1 article in which they are cited (mean value and CI); a version 2 (or higher) article in which they are not cited (mean value and CI); and a version 2 (or higher) article in which they are cited (mean value and CI).

      This figure is based on an analysis of [Please state how many articles, reviewers, reviews etc are included in this analysis].

      In all the panels a dot represents a mean, and a horizontal line represents an adjusted 99.4% confidence interval.

      Reviewer #1 (Recommendations for the Authors):

      A big recommendation to the author would be to consider putting a lot of the statistical analysis in an appendix and describing the methods and results in more accessible terms in the main text. This would help more readers see the baby through the bath water

      I have moved four paragraphs from the methods to the supplement. These are the sample size, the two sensitivity analyses on including co-reviewers and confounding by reviewers’ characteristics, and the analysis examining potential bias for the reviewers with no OpenAlex record.

      One possibility, that may have been accounted for, but it is hard to say given the density of the analysis, is the possibility that an author who follows the recommendations to cite the reviewer has also followed all the other reviewer requests. This could account for the much higher likelihood of acceptance. Conversely an author who has rejected the request to cite the reviewer may be more likely to have rejected many of the other suggestions leading to a rejection. I couldn't discern whether the analysis had accounted for this possibility. If it has it need to be said more prominently, if it hasn't this possibility at least needs to be discussed. It would be good to see other alternative explanations for the results discussed (and if possible dismissed) in the discussion section too.

      This is an interesting idea. It’s also possible that authors more often accept and include any citation requests as it gives them more license to push back on other more involved changes that they would prefer not to make, e.g., running a new analysis. To examine this would require an analysis of the authors’ responses to the reviewers, and I have now added this as a limitation.

      I hope this paper will have an impact on scientific publishing but I fear that it won't. This is no reflection on the paper but a more a reflection on the science publishing system.

      I do not have any additional references (written by myself or others!) I would like the author to include

      Thanks. I appreciate that extra thought is needed when peer reviewing papers on peer review. I do not know the reviewers’ names! I have added one additional reference suggested by the reviewers which had relevant results on previous surveys of coercive citations for the section on “Related research”.

      Reviewer #2 (Recommendations for the Authors):

      (1) Would it be possible for the author to control for academic discipline? Some disciplines cite at different rates and have different citation sub-cultures; for example, Wilhite and Fong (2012) show that editorial coercive citation differs among the social science and business disciplines. Is it possible that reviewers from different disciplines just take a totally different view of requesting self-citations?

      Wilhite, A.W., & Fong, E.A. 2012. Coercive citation in academic publishing. Science, 335: 542-543.

      This is an interesting idea, but the number of disciplines would need to be relatively broad to keep a sufficient sample size. The Catch-22 is then whether broad disciplines are different enough to show cultural differences. Overall, this is an idea for future work.

      (2) I would like the author to be much more clear about their results in the discussion section. In line 214, they state that "Reviewers who requested a self-citation were much less likely to approve the article for all versions." Maybe in the discussion some language along the lines of "Although reviewers who requested self-citation were actually much less likely to approve an article, my more detailed analyses show that this was not the case when reviewers requested a self-citation without reason or with the inclusion of coercive language such as 'need' or 'please'." Again, word it as you like, but I think it should be made clear that requests for self-citation alone is not a problem. In fact, I would argue that what the author says in lines 250 to 255 in the discussion reflects that reviewers who request self-citations (maybe for good reasons) are more likely to be the real experts in the area and why those who did not request a self-cite did not notice the omission. It is my understanding that editors are trying to get warm bodies to review and thus reviewers are not all equally qualified. Could it be that requesting self-citations for a good reason is a proxy for someone who actually knows the literature better? I'm not saying this is s fact, but it is a possibility. I get this is said in the abstract, but worth fleshing out in the discussion.

      I have updated the discussion after a new text analysis and have addressed this important question of whether self-citations are different from citations to other articles. The idea that some self-citers are more aware of the relevant literature is interesting, although this is very hard to test because they could also just be more aware of their own work. The question of whether self-citations are justified is a key question and one that I’ve tried to address in an updated discussion.

      Reviewer #3 (Recommendations for the Authors):

      Data and code availablility are in good shape. At a high level, I recommend:

      Toning down the interpretation of reviewers' motivation, especially since some of this is mitigated by findings presented in the paper.

      I have reworded the discussion and included a warning on the observational study design.

      Devote more time detailing exactly what data are being presented in each figure/table and results section as described in more detail in the main review (n, selection criteria, conditional subsetting, etc.).

      I agree and have provided more details in each figure legend.

      Reviewer #4 (Recommendations for the Authors):

      A few aspects of the paper are not clear:

      I did not follow Figure 4. Are the "self citation" labels supposed to be "citation to other research"?

      Thanks for picking up this error which has now been fixed.

      I did not understand how to parse the left column of Figure 2

      As per the editor’s suggestion, the figure legend has been updated.

      Table 3: Please use different markers for the different curves so that it is clearly demarcated even in grayscale print

      I presume you meant Figure 3 not Table 3. I’ve varied the symbols in all three odds ratio plots.

      Supplementary S3: Typo "Approvep" Fixed, thanks.

      OTHER CHANGES: As well as the four reviews, my paper was reviewed by an AI-reviewer which provided some useful suggestions. I have mentioned this review in the acknowledgements. I have reversed the order of figure 5 to show the probability of “Approved” as this is simpler to interpret.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #3 (Public review):

      The authors have satisfactorily addressed my inquiries. However, I had to look quite hard to find where they responded to my final comment regarding the potential role of Arpc2 post-fusion during myofiber growth and/or maintenance, which I eventually located on page 7. I would appreciate it if the authors could state this point more explicitly, perhaps by adding a sentence such as "However, we cannot rule out the possibility that Arpc2 may also play a role in....." to improve clarity of communication. 

      While I understood from the original version that this issue falls beyond the immediate scope of the study, I believe it is important to adopt a more cautious and rigorous interpretative framework, especially given the widespread use of this experimental approach. In particular, when a gene could potentially have additional roles in myofibers, it may be helpful to explicitly acknowledge that possibility. Even if Arpc2 may not necessarily be one of them, such roles cannot be fully excluded without direct testing.  

      We appreciate the reviewer’s comments and have included several sentences at the end of the “Branched actin polymerization is required for SCM fusion” section to address this question:

      “The severe myoblast fusion defects observed in early stages of regeneration (e.g. dpi 4.5) provide a good explanation for the presence of thin muscle fibers in ArpC2 cKO mice at dpi 14 (Fig. 2B and 2C) and dpi 28 (Fig. S4A and S4B). These thin muscle fibers could be either elongated mononucleated muscle cells or multinucleated myofibers each containing a small number of nuclei due to occasional fusion events (comparable to those in Myomixer cKO muscles) (Fig. 2B and 2C; Fig. S4A and S4B). Whether Arp2/3 and branched actin polymerization play a role in the growth and/or maintenance of post-fusion multinucleated myofibers requires future loss-of-function studies in which ArpC2 cKO is generated using a myofiber-specific cre driver.”

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      The revised manuscript addresses several reviewer concerns, and the study continues to provide useful insights into how ZIP10 regulates zinc homeostasis and zinc sparks during fertilization in mice. The authors have improved the clarity of the figures, shifted emphasis in the abstract more clearly to ZIP10, and added brief discussion of ZIP6/ZIP10 interactions and ZIP10's role in zinc spark-calcium oscillation decoupling. However, some critical issues remain only partially addressed. 

      Thank you for your valuable inputs. We plan to address the issues that could not be clarified in this report going forward.

      (1) Oocyte health confound: The use of Gdf9-Cre deletes ZIP10 during oocyte growth, meaning observed defects could result from earlier disruptions in zinc signaling rather than solely from the absence of zinc sparks at fertilization. The authors acknowledge this and propose transcriptome profiling as a future direction. However, since mRNA levels often do not accurately reflect protein levels and activity in oocytes, transcriptomics may not be particularly informative in this context. Proteomic approaches that directly assess the molecular effects of ZIP10 loss seem more promising. Although current sensitivity limitations make proteomics from small oocyte samples challenging, ongoing improvements in this area may soon allow for more detailed mechanistic insights.

      Thank you for your suggestions. We will keep that in mind for the future.

      (2) ZIP6 context and focus: The authors clarified the abstract to emphasize ZIP10, enhancing narrative clarity. This revision is appropriate and appreciated. 

      Thanks to your feedback, my paper has improved. Thank you for your evaluation.

      (3) Follicular development effects: The biological consequences of ZIP6 and ZIP10 knockout during folliculogenesis are still unknown. The authors now say these effects will be studied in the future, but this still leaves a major mechanistic gap unaddressed in the current version. 

      As you mentioned, we have not been able to clarify the effects of ZIP6 and ZIP10 knockout on follicle formation. The effects of ZIP6 and ZIP10 knockout on follicle formation will be discussed in the future.

      (4) Zinc spark imaging and probe limitations: The addition of calcium imaging enhances the clarity of Figure 3. However, zinc fluorescence remains inadequate, and the authors depend solely on FluoZin-3AM, a dye known for artifacts and limited ability to detect subcellular labile zinc. The suggestion that C57BL/6J mice may differ from CD1 in vesicle appearance is plausible but does not fully address concerns about probe specificity and resolution. As the authors acknowledge, future studies with more selective probes would increase confidence in both the spatial and quantitative analysis of zinc dynamics. 

      Thank you for your comment. Moving forward, we plan to conduct spatial and quantitative analyses of zinc dynamics using various other zinc probes.

      (5) Mechanistic insight remains limited: The revised discussion now recognizes the lack of detailed mechanistic understanding but does not significantly expand on potential signaling pathways or downstream targets of ZIP10. The descriptive data are useful, but the inability to pinpoint how ZIP10 mediates zinc spark regulation remains a key limitation. Again, proteomic profiling would probably be more informative than transcriptomic analysis for identifying ZIP10-dependent pathways once technical barriers to low-input proteomics are overcome. 

      Thank you for your helpful advice. I'll use it as a reference for future analysis.

      Future studies should assess the transcriptomic or proteomic profile of Zip10<sup>d/d</sup> mouse oocytes (P.11 Line 349-350).

      Overall, the authors have reasonably revised and clarified key points raised by reviewers, and the manuscript now reads more clearly. However, the main limitation, lack of mechanistic insight and the inability to distinguish between developmental and fertilization-stage roles of ZIP10, remains unresolved. These should be explicitly acknowledged when framing the conclusions.

      We have added the two limitations you pointed out to the conclusion section of the main text.

      However, the role of ZIP6 remained uncertain. Additionally, the absence of mechanistic insight for zinc spark and the inability to distinguish between the developmental and fertilization stage roles of ZIP10 remain unresolved. These challenges necessitate further investigation (P.11-12 Line 354-357).

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Recommendations for the authors):

      I think the authors did a fantastic job investigating the annotation issues I brought up in the first round. I am somewhat assured that the size of the dataset has prevented any real systematic issues from impacting their results. However, there are many clear underlying biases in the data, as the authors show, which could have a number of unexpected impacts on the results. For example, the consistently lower gene numbers could be biased towards certain types of genes or in certain lineages, making the CAZyme analysis unreliable. I do not agree with the author's choice to put these results in as a supplement with little or no other references to it in the main manuscript. Many of the conclusions that are drawn should be hedged by these findings. There should at least be a rational given for why the authors took the approach they did, such as mentioning the points they brought up in the response.

      We thank the reviewer for the positive assessment of our revision. We added text in the Discussion acknowledging limitations of the gene annotation approach. 

      “Because of the uniform yet simplified gene annotation approach, the total number of genes may be underestimated in some assemblies in our dataset, as observed when comparing the same species in JGI Mycocosm. Although this pattern is not biased toward any particular group of species, access to high-quality, well-annotated genomes could provide a clearer picture of the relative contributions of specific gene families.”

      We also added more text in the Methods (section "Sordariomycetes genomes") mentioning in more detail the investigation of potential biases related to assembly quality and annotation (with reference to Supplementary Results).

      A couple minor corrections:

      Figure 1C, both axes say PC1?

      Fixed.

      Figure S12, scales don't match so it's hard to compare, axis labels are inconsistent.

      Fixed.

      Reviewer #2 (Recommendations for the authors):

      I congratulate the authors on the revision work. Their manuscript is very interesting and reads very well.

      I found several occurrences of « saprophyte ». Note that « saprotoph » is much better since fungi are not « phytes ».

      We thank the reviewer for positive feedback. The occurrences of “saprophytes” were corrected.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Sha K et al aimed at identifying the mechanism of response and resistance to castration in the Pten knockout GEM model. They found elevated levels of TNF overexpressed in castrated tumors associated with an expansion of basal-like stem cells during recurrence, which they show occurring in prostate cancer cells in culture upon enzalutamide treatment. Further, the authors carry on a timed dependent analysis of the role of TNF in regression and recurrence to show that TNF regulates both processes. Similarly, CCL2, which the authors had proposed as a chemokine secreted upon TNF induction following enzalutamide treatment, is also shown to be elevated during recurrence and associated with the remodeling of an immunosuppressive microenvironment through depletion of T cells and recruitment of TAMs.

      Strengths:

      The paper exploits a well-established GEM model to interrogate mechanisms of response to standard-of-care treatment. This is of utmost importance since prostate cancer recurrence after ADT or ARSi marks the onset of an incurable disease stage for which limited treatments exist. The work is relevant in the confirmation that recurrent prostate cancer is mostly an immunologically "cold" tumor with an immunosuppressive immune microenvironment

      Weaknesses:

      While the data is consistent and the conclusions are mostly supported and justified, the findings overall are incremental and of limited novelty. The role of TNF and NF-kB signaling in tumor progression and the role of the CCL2-CCR2 in shaping the immunosuppressive microenvironment are well established.

      We contend there is novelty in: the experimental design; our finding of a TNF signaling ‘switch’ and the role of androgen-deprivation induced immunosuppression.    

      On the other hand, it is unclear why the authors decided to focus on the basal compartment when there is a wealth of literature suggesting that luminal cells are if not exclusively, surely one of the cells of origin of prostate cancer and responsible for recurrence upon antiandrogen treatment. As a result, most of the later shown data has to be taken with caution as it is not known if the same phenomena occur in the luminal compartment.

      While we appreciate the reviewer’s interest in the cancer stem cell biology occurring in the tumor in response to androgen deprivation, our focus in this report is identifying mechanisms that account for a switch in TNF signaling.  Specifically, our previous studies showed a rapid increase in TNF mRNA following castration (in the normal murine prostate) but in the current report we also observe an increase in TNF at late times post-castration (in a murine prostate cancer model).  We propose that the increase in TNF at late times is due to plasticity (increased stemness) in the tumor cell population, rather than - for example - a change in signal-driven TNF mRNA transcription.  While a possible mechanism is expansion of a recurrent tumor stem-cell population, a careful investigation is beyond the scope of this report.  Therefore, in the revised manuscript, we have altered the text in multiple places to indicate a suggestive, rather than definitive, role for tumor stem cells.  Indeed, we did include caveats regarding the role of tumor stem cells in the original discussion (lines 425-429 in the revised manuscript), and this is now made more explicit in the revised manuscript.   

      Reviewer #2 (Public Review):

      Summary:

      In this study, Sha and Zhang et al. reported that androgen deprivation therapy (ADT) induces a switch to a basal-stemness status, driven by the TNF-CCL2-CCR2 axis. Their results also reveal that enhanced CCL2 coincides with increased macrophages and decreased CD8 T cells, suggesting that ADT resistance may be related to the TNF/CCL2/CCR2-dependent immunosuppressive tumor microenvironment (TME). Overall, this is a very interesting study with a significant amount of data.

      Strengths:

      The strengths of the study include various clinically relevant models, cutting-edge technology (such as single-cell RNA-seq), translational potential (TNF and CCR2 inhibitors), and novel insights connecting stemness lineage switch to an immunosuppressive TME. Thus, I believe this work would be of significant interest to the field of prostate cancer and journal readership.

      Weaknesses:

      (1) One of the key conclusions/findings of this study is the ADT-induced basal-stemness lineage switch driving ADT resistance. However, most of the presented evidence supporting this conclusion only selects a couple of marker genes. What exacerbates this issue is that different basal-stemness markers were often selected with different results. For example, Figure S1A uses CD166/EZH2 as markers, while Figure S1B uses ITGb1/EZH2. In contrast, Figure 1D uses Sca1/CD49, and Figure 2B-C uses CD49/CD166. Since many basal-stemness lineage gene signatures have been previously established, the study should examine various basal-stemness gene signatures rather than a couple of selected markers. Moreover, why were none of the stemness/basal-gene signatures significantly changed in the GO enrichment analysis in Figure 6A/B?

      Mice and human cells express similar but also partially distinct prostate stem cell markers.  For example, Sca1 is predominantly used as a stem cell marker in mice but not in human prostate epithelial cells.  CD166 and CD49f are expressed in both human and murine prostate epithelium and therefore we used these in both sets of studies.  Also see the response to R1-2.

      (2) A related weakness is the lack of functional results supporting the stemness lineage switch. Although the authors present colony formation assay results, these could be influenced simply by promoted cell proliferation, which is not a convincing indicator of stemness. To support this key conclusion, widely accepted stemness assays, such as the prostasphere formation assay (in vitro) and Extreme Limiting Dilution Analysis (ELDA) xenograft assay (in vivo), should be carried out.

      See the response to R1-2 and R2-1, above.

      (3) Another significant concern is that this study uses concurrency to demonstrate a causal relationship in many key results, which is entirely different. For example, Figure S4A and S4B only show increased CCL2 and TNF secretion simultaneously, which cannot support that CCL2 is dependent on TNF. Similarly, Figure 5A only shows that CCL2 increased coincidently with a rise in TNF, which cannot support a causal relationship. To support the causal relationship of this conclusion, it is necessary to show that TNF-KO/KD would abolish the increased CCL2 secretion.

      Regarding Fig. S4A and S4B: We previously demonstrated (Sha et al, 2015; reference 10) that CCL2 secretion is dependent on TNF, in the same cell lines.  We have added additional data (new Fig. S4B) in this report to confirm this dependency.  

      Regarding Fig 5: In Fig 5B we demonstrated that the increase in CCL2-staining cells in recurrent tumors from castrated animals (the equivalent of human CRPC in our model) was significantly inhibited in animals receiving etanercept, demonstrating TNF dependency for CCL2 in this context.  

      While the use of TNF KO cell lines and animals could provide additional insights, the creation of such cell lines and tumor models is arduous.  Moreover, we previously demonstrated that administration of anti-TNF drugs such as etanercept are as effective as the KO phenotypes (Davis et al 2011; ref. 11).  

      (4) Some of the selective data presentations are not explained and are difficult to understand. For example, why does CD49 staining in Figure S3A have data for all four time points, while CD166 in Figure S3D only has data for the last time point (day 21)? Similarly, although several TNF_UP gene signatures were highlighted in Figure 4B, several TNF_DN signatures were also enriched in the same table, such as RUAN_RESPONSE_TO_TNF_DN. What is the explanation for these contrasting results?

      Regarding Fig. S3A and S3D: The cell-staining studies in Fig. S3 are confirmatory of the FACS studies in Figs. 2 and 3.  We were not able to stain all of the CD166 time-points for technical reasons (difficulty optimizing the automated staining protocol) but we were able to successfully stain key late time-points, so we have included this data in the supplementary figure.  There was no attempt to selectively present data; this was just a practical limitation of the time and funds that we could devote to confirmatory studies.   

      Regarding Fig 4B: The highlighting identifies a common (i.e., identical) group of gene sets in the two GSEA analyses, demonstrating that these very same gene sets are all up-regulated in one instance, and down-regulated in the other.  The ‘TNF DN’ genes were not identical in the two GSEA analyses and so we cannot draw any conclusions about these.  Note that we are scoring the TNF-related genes sets with the 10 largest (positive or negative) normalized enrichment scores (NES), and are not relying on DN or UP designations in the gene set name (identifier).  In this analysis up- and down-regulation refers to the sign and magnitude of the NES, not the gene set names.  

      Reviewer #3 (Public Review):

      Summary:

      The current manuscript evaluates the role of TNF in promoting AR targeted therapy regression and subsequent resistance through CCL2 and TAMs. The current evidence supports a correlative role for TNF in promoting cancer cell progression following AR inhibition. Weaknesses include a lack of descriptive methodology of the pre-clinical GEM model experiments and it is not well defined which cell types are impacted in this pre-clinical model which will be quite heterogenous with regards to cancer, normal, and microenvironment cells.

      Strengths:

      (1) Appropriate use of pre-clinical models and GEM models to address the scientific questions.

      (2) Novel finding of TNF and interplay of TAMs in promoting cancer cell progression following AR inhibition.

      (3) Potential for developing novel therapeutic strategies to overcome resistance to AR blockade.

      Weaknesses:

      (1) There is a lack of description regarding the GEM model experiments - the age at which mice experiments are started.

      Table S1 in the supplementary data summarizes the salient characteristics of the GEM models.  Note that as described in the M&M, we selected animals for experimental groups based on the tumor volume (determined by HFUS) and not based on the age of the mouse, since there is some variability in the kinetics of tumor growth in genetically identical mice, as shown by our HFUS observations of hundreds of mice harboring the genetic changes (PTEN loss, MYC gain) in the models we have studied most extensively.  Although admittedly an imperfect criteria, we reasoned that tumor volume would be the best surrogate criteria for tumor biology.  

      (2) Tumor volume measurements are provided but in this context, there is no discussion on how the mixed cancer and normal epithelial and microenvironment is impacted by AR therapy which could lead to the subtle changes in tumor volume.

      The reviewer’s criticism is well-founded - most of our studies involved bulk analysis, which makes it difficult to probe the cellular interactions within the TME.  Future studies - beyond the scope of this report - using single cell technical approaches - are needed to investigate these subtle changes.  We have added a statement to this effect to the manuscript (lines 464-468).

      (3) There are no readouts for target inhibition across the therapeutic pre-clinical trials or dosing time courses.

      The reviewer’s criticism is well-founded, since we cannot be 100% certain of drug delivery in the TNF and CCL2 blockade experiments.  Two points in this regard.  First, with the assistance of institutional veterinarian staff, we have had good success in training multiple scientists (PhD student, technicians) to deliver both biological and small molecule drugs i.p.  Second, the observation that the drugs did ‘work’ in most animals in well-defined experimental protocols strongly suggests that the delivery methodology is reliable.  If sporadic delivery failures do occur, this would tend to underestimate the magnitude of the ‘positive’ (i.e., blocking) effects rather than leading to false negatives.   

      (4) The terminology of regression and resistance appears arbitrary. The data seems to demonstrate a persistence of significant disease that progresses, rather than a robust response with minimal residual disease that recurs within the primary tumor.

      We explain our rationale for the criteria defining regression and recurrence in the M&M and in the legend to Table S2.  In the revised version of the manuscript, we now explicitly reference these descriptions in the relevant RESULTS section (lines 222-223).  Note that we use the term ‘recurrence’ rather than ‘resistance’ as the former does not necessarily imply a particular biological mechanism.  

      (5) It is unclear if the increase in basal-like stem cells is from normal basal cells or cancer cells with a basal stem-like property.

      See the response to R1-2 and R2-1.

      (6) In the Hi-MYC model, MYC expression is regulated by AR inhibition and is profoundly ARi responsive at early time points.

      We agree that this is the likely mechanism of castration-induced regression (so-called ‘MYC addiction’) but it is unclear what the reviewer’s concern is vis-a-vis our manuscript.  

      Reviewer #4 (Public Review):

      In this manuscript by Sha et al. the authors test the role of TNFa in modulating tumor regression/recurrence under therapeutic pressure from castration (or enzalutamide) in both in vitro and in vivo models of prostate cancer. Using the PTEN-null genetic mouse model, they compare the effect of a TNFα ligand trap, etanercept, at various points pre- and post-castration. Their most interesting findings from this experiment were that etanercept given 3 days prior to castration prevented tumor regression, which is a common phenotype seen in these models after castration, but etanercept given 1 day prior to castration prevented prostate cancer recurrence after castration. They go on to perform RNA sequencing on tumors isolated from either sham or castrate mice from two time points post-castration to study acute and delayed transcriptional responses to androgen deprivation. They found enrichment of gene sets containing TNF-targets which initially decrease post-castration but are elevated by 35 days, the time at which tumors recur. The authors conduct a similar set of experiments using human prostate cancer cell lines treated with the androgen receptor inhibitor enzalutamide and observe that drug treatment leads to cells with basal stem-like features that express high levels of TNF. They noticed that CCL2 levels correlate with changes in TNF levels raising the possibility that CCL2 might be a critical downstream effector for disease recurrence. To this end, they treated PTEN-null and hi-MYC castrated mice with a CCR2-antagonist (CCR2a) because CCR2 is one receptor of CCL2 and monitors tumor growth dynamics. Interestingly, upon treatment with CCR2a, tumors did not recur according to their measurements. They go on to demonstrate that the tumors pre-treated with CCR2a had reduced levels of putative TAMs and increased CTLs in the context of TNF or CCR2 inhibition providing a cellular context associated with disease regression. Lastly, they perform single-cell RNA sequencing to further characterize the tumor microenvironment post-castration and report that the ratio of CTLs to TAMs is lower in a recurrent tumor.

      While the concepts behind the study have merit, the data are incomplete and do not fully support the authors' conclusions. The author's definition of recurrence is subjective given that the amount of disease regression after castration is both variable (Figure 8) and relatively limited

      See the response to R3-4, above.

      particularly in the PTEN loss model. Critical controls are missing. For example, both drug experiments were completed without treating non-castrate plus drug controls

      In these experiments, we are investigating the effect of anti-TNF or anti-CCL2 therapy on the response to the castration.  The appropriate controls are castrated mice which received vehicle or no treatment.  The response of intact animals (with tumors still increasing in size) is not only irrelevant to the question we are asking, but also impractical, as the tumor size would be too large for mouse viability. 

      which raises the question of how specific these findings are to castration resistance. No validation was performed to ensure that either the TNF ligand trap or the CCR2 agonist was acting on target. 

      See the response to R3-3, above.

      The single-cell sequencing experiments were done without replicates which raises concern about its interpretation. 

      The goal in these experiments is to address a relatively narrow question concerning changes in a few key TAM-associated transcripts versus changes in a few CTL-associated transcripts.  This is not meant to provide rigorous single cell transcriptomic analysis that is required - for example - to definitely assess the levels of various cell populations.   As noted in R3-2 (and in the DISCUSSION , lines 467-468) future single cell analysis is ongoing, but beyond the scope of this manuscript.

      At a conceptual level, the authors say that a major cause of disease recurrence in the immunosuppressive TME, but provide little functional data that macrophages and T cells are directly responsible for this phenotype.   

      The requirement for CCL2-CCR2 signaling for recurrence suggests that TAMs drive recurrence, presumably due to immunosuppression in the TME.  However, CCR2 is expressed by other cell types.  Therefore, in future studies we will need to examine the response to additional inhibitors and also employ single cell ‘omics to more thoroughly characterize the changes in the cellular components of the tumor immune microenvironment.  Functional analysis of T-cell subsets is an even more formidable experimental challenge.  

      Statistical analyses were performed on only select experiments. 

      See the response to R1-3, below.

      In summary, further work is recommended to support the conclusions of this story.

      Reviewer #1 (Recommendations For The Authors):

      I suggest the authors address the following:

      (1) Throughout the figures, statistical analysis needs to be made clear including n numbers, replicates, and whether or not differences shown are statistically significant. These includes Figure 1c, and d,; Figure 2 A and B, Figure 3A; Figure 4A; Figure 5A, C and D; Figure 7B.

      We thank the reviewer for identifying these issues and we have inserted statistical analyses into the text as follows: 

      Figure 1C-D: Statistical analysis added to the legend of Fig. 1.  

      FIgure 2A: Statistical analysis added to the legend of Fig. 2.

      Figures 2B: These are representative FACS scatter plots –  the corresponding statistical analysis is shown in Fig. 2C (left panel).  

      Figure 3A: Statistical comparisons are not relevant to this figure – the data is presented to document the cell sorting enrichment process.

      Figure 4A and Figure 5C-D:  For the small n, categorical data sets related to the studies using GEM prostate cancer models shown in Figures 4A, 5C and 5D, we employed the exact binomial test to determine the Clopper-Pearson confidence interval for the proportion and Fisher’s exact test to determine the p-values and now present these analyses in a new Supplementary Table 3.  We have included this information in the M&M section and edited the Figure legends to direct the reader to the new Supplementary Table.  

      We would like to emphasize that the reported p-values are exact probabilities from Fisher’s exact test. Given the small sample sizes and the discrete nature of the distribution, these values should not be interpreted as if they strictly conform to conventional thresholds such as p<0.05. Instead, they represent the exact probability of observing data as extreme as (or more extreme than) what we obtained under the null hypothesis.

      Figure 5A: The legend of Fig. 5A was edited to clarify the statistical analysis.  

      Figure 7B: The differences in CD8+ T cells and F4/80 macrophages due to CCR2a-35d treatment were not statistically different (p>0.05) - we have now stated this explicitly in the figure legend.  

      (2) Several experiments either lack appropriate controls or the choice of data presentation is confusing. In Figure 4A vehicle controls should 

      We have not observed any effect of IP administration of vehicle in any experiments across multiple published studies employing these GEMMs, and so we conclude that the injection of vehicle is very unlikely to modify the outcome of these experiments.

      be included in the graphs and for ease of interpretation perhaps average tumor growth should be shown with individual tumor growth can be shown in the supplement. In Figure 5 the vehicle control is missing and in Figure 5D 4 out of 5 CX+vehicle tumors are said to have recurred but the trend line in the graph shows otherwise.

      We thank the reviewer for noting this issue - the color designations were inadvertently reversed in the legend text.  This error has been corrected in the revised version of the manuscript.  

      In Figure 8B flow cytometry would actually be more convincing than scRNAseq. If scRNAseq is chosen, a higher quality UMAP or t_SNE plot is needed with a broader color palette.

      We did consider the FACS approach suggested by the reviewer, but decided against it as we could not readily identify and validate a TAM-specific antibody to allow such measurements. 

      Reviewer #3 (Recommendations For The Authors):

      (1)  A clear description of the GEM model experiments will be helpful in interpreting the data as it is unclear what age the PTEN or MYC mice were when therapy was started. PTEN are generally intrinsically resistant to ARi whereas MYC are robustly sensitive.

      (2) Prostate organoid technology of the GEM prostate cell, and normal prostate cells may allow for a better evaluation of which basal stem-like cells are expressing TNF - dissecting out normal basal from cancer with basal-like properties.

      (3) Experiments to demonstrate targeting inhibition should be performed for AR and TNF inhibition. Especially across the spectrum of TNF blockade timing given the differences in proposed responsiveness over an acute change in dosing schedule.

      (4) Detailed histology and pathologic evaluation should be provided to characterize the impact on cancer and TME as well as normal prostate mixed in these tumors.

      (5) Prostate organoid development with genetic manipulation (PTEN ko) and transplant back into immunocompetent mice may provide experiments to prove causality and address the impact on the immune microenvironment.

      (6) The descriptive of regression and recurrence need to be defined as based on the kinetics and presented data this seems to be associated with minimal responsiveness and progression from a substantial volume of persistent cells.

      (7) The authors should also explore the impact of TNF inhibition on the cancer cell directly and evaluate downstream PI3K signaling.

      Responding to this set of recommendations:  A number of these recommendations (R3-7, -9, -12) are similar or identical to those already noted in Reviewer 3’s public review and have been addressed above.  The remaining recommendations (R3-8, -10, -11; organoids, histological approaches to the TME, etc.) are potentially interesting experimental approaches but beyond the scope of the current manuscript.  

      Reviewer #4 (Recommendations For The Authors):

      Major comments:

      (1) Figure 1A-B: While the decrease in tumor growth post-castration is apparent, the increase in tumor growth that has been designated as the point of androgen-independence is a mild increase from the 28 measurements and would benefit from statistical support. Further time points demonstrating that the tumors continue to increase in size would better support the claim that these tumors appropriately model disease recurrence.

      This data meets our criteria for recurrence (outlined in the M&M and in the legend to Table S2).

      (2) Figure 2A: Statistical analysis should be performed and why is this figure shown twice (also in the S2A right panel)?

      We added statistical analysis to the legend of Fig. 2A.  The data from Fig 2 (C4-2 cell line) is replicated in Supplementary Fig S2 to allow the reader to directly compare the response of the C4-2 cell line with the response of the LNCaP cell line.   

      (3) Figure 4A: Non-castrate + etan control is needed here. Also, the data should be statistically assessed.

      Regarding non-castrate controls, see our response to R4-2.  Statistical analysis has been added - see Supplementary Table S3.   

      (4) It appears that at least two of the mice shown in Figure 5C have the same level of disease recurrence as was demonstrated in Figure 1B, yet the analysis defines recurrence in 0/6 mice.

      Again, similar to R4-7, None of the mice in Figure 5C meet our criteria for recurrence (outlined in the M&M and in the legend to Table S2).

      (5) The text for Figure 5D states that vehicle-treated tumors (red) regress then recur while mice pre-treated with a CCR2 antagonist (blue) don't recur, but in the figure, these groups appear to be reversed. In addition, it would be good to have noncastrate + CCR2a control for Figure 5C and 5D.

      We corrected the labeling error in the legend to Figure 5.

      (6) It would be good to validate major RNAseq findings using orthogonal approaches.

      We agree that it is valuable to validate our findings but these experiments are beyond the scope of the manuscript

      (7) Figure 7B is quite puzzling. It appears to show the opposite of what was written.

      We thank the reviewer for bringing this error to our attention.  Our internal review of previous versions of the manuscript showed that the corresponding author (JJK) inadvertently mis-edited this figure when preparing the BioRxiv submission.  Figure 7B has been corrected and now aligns with the Results text. We have also appended a PDF documenting the editing error/ mistake.  

      (8) Figure 8: This experiment appears to have been done without replicates making the current interpretation questionable.

      A more detailed scRNAseq analysis of the GEMM response to castration (with replicated) is already underway.  The analysis in Fig. 8 includes 1000’s of cells, capturing the variation in mRNA levels.  However, it does not capture animal-to-animal variation.  Given the supporting role of this data in this manuscript, we believe that the single animal approach is adequate in this case.  

      (9) The level of detail included in the mechanism described in Figure S8 is not supported by the work shown.

      Fig. S8 is not presented as a summary of our findings but as a model that is consistent with our data - since it is by definition somewhat speculative, we present it in the supplementary data.   

      Minor Comments:

      (1) Figure 6S title is written incorrectly.

      We thank the reviewer for noticing this - we have corrected this in the revised manuscript.

      (2) Images shown in Figure S7C need scale bars.

      These images are at 40X magnification - this has been added to the legend.

    1. Author response:

      Reviewing Editor Comments:

      Based on the feedback from the reviewers, a focus on the following major points has the potential to improve the overall assessment of the significance of the findings and the strength of the evidence:

      (1) It would be helpful to clearly articulate how these findings advance the field beyond what has already been demonstrated or suggested in other systems.

      We will revise the Introduction and Discussion to better contextualize our findings. We will provide a careful comparison of the Ciona atrial siphon invagination with the other established systems to elucidate the unique aspects of our model. Highlighting our discovery of a novel bidirectional "lateral-apical-lateral" contractility as a distinct mechanical paradigm for sequential morphogenesis.

      (2) It would be helpful to clarify the meaning of "translocation" and more explicitly describe the temporal and spatial patterns of active myosin localization during the two steps of invagination.

      We will replace “translocation” with the more accurate and conservative term “redistribution” throughout the manuscript, including in the title. We will also revise the text in Result and Discussion sections to avoid overinterpretation. To provide a more explicit description of the spatiotemporal patterns, we will add new quantitative analyses of active myosin intensity from earlier time points (13-14 hpf) to rigorously support the initial lateral-to-apical redistribution phase. Then, we will add high-resolution top-view images to unambiguously show the ring-like localization of myosin at the apical cell-cell junctions during the initial stage. Finally, we will correct the schematic in Figure 2C to accurately reflect the predominant localization of active myosin at the apical cell-cell borders.

      (3) It would be helpful to explain how the optogenetic data support the conclusion that "redistribution of myosin contractility from the apical to lateral regions is essential for the development of invagination".

      We acknowledge the limitation of the original global inhibition experiment. We will perform additional experiments that combine optogenetic inhibition with subsequent immunostaining of the active myosin. By quantitatively comparing the distribution of actomyosin in light-stimulated versus dark-control embryos, we will be able to demonstrate whether the inhibition prevents the establishment of the lateral contractility domain. This will allow us to refine our conclusion.

      (4) It would be helpful to describe how the modeling work fits within the existing literature on modeling epithelial folding and to address discrepancies between the model and the actual biological observations, such as tissue curvature, limited invagination depth in the model, and the "puckering" surrounding the invagination. In addition, certain descriptions of the modeling results should be clarified, as suggested by Reviewer #3.

      We fully agree that we should discuss the existing theoretical work on epithelial folding more clearly. Clarifying how physical forces contribute to invagination is central to interprete the underlying mechanisms, and we appreciate the opportunity to better connect our framework to existing studies. In the revision, we will expand the Introduction and Discussion to place our model in the appropriate theoretical context and highlight how it relates to and differs from previous approaches. At the same time, we will extend the model to a curved geometric framework to more accurately reproduce the experimental observations, which will improve its predictive value. We will also revise the descriptions and schematic representations of the modeling results to enhance clarity and better align them with the biological data.

      (5) It would be helpful to elaborate on the methods for quantitative image analysis and statistical tests.

      We will thoroughly expand the Methods section to provide a detailed step-by-step description of image quantification procedures, including precise definitions of the apical, lateral, and basal domains used for intensity measurements and the measurement of cell surface areas and invagination depths.

      Reviewer #1 (Public review):

      Summary:

      This paper investigates the physical basis of epithelial invagination in the morphogenesis of the ascidian siphon tube. The authors observe changes in actin and myosin distribution during siphon tube morphogenesis using fixed specimens and immunohistochemistry. They discover that there is a biphasic change in the actomyosin localization that correlates with changes in cell shapes. Initially, there is the well-known relocation of actomyosin from the lateral sides to the apical surface of cells that will invaginate, accompanied by a concomitant lengthening of the central cells within the invagination, but not a lot of invagination. Coincident with a second, more rapid, phase of invagination, the authors see a relocalization of actomyosin back to the lateral sides of the cells. This 2nd "bidirectional" relocation of actin appears to be important because optogenetic inhibition of myosin in the lateral domain after the initial invaginations phase resulted in a block of further invagination. Although not noted in the paper, that the second phase of siphon invagination is dependent on actomyosin is interesting and important because it has been shown that during Drosophila mesoderm invagination that a second "folding" phase of invagination is independent of actomyosin contraction (Guo et al. elife 2022), so there appear to be important differences between the Drosophila mesoderm system and the ascidian siphon tube systems.

      Using the experimental data, the authors create a vertex model of the invagination, and simulations reveal a coupled mechanism of apicobasal tension imbalance and lateral contraction that creates the invagination. The resultant model appears to recapitulate many aspects of the observed cell behaviors, although there are some caveats to consider (described below).

      We sincerely thank you for this insightful comment and for bringing the important study by Guo et al. (2022) to our attention. We fully agree that a direct comparison between these two mechanisms is important of our findings. As you astutely point out, the fundamental difference lies in the autonomy and driving force of the second, rapid invagination phase. To highlight this important conceptual advance, we will add a dedicated paragraph in the Discussion section to explicitly discuss this point.

      Strengths:

      The studies and presented results are well done and provide important insights into the physical forces of epithelial invagination, which is important because invaginations are how a large fraction of organs in multicellular organisms are formed.

      Thank you for this positive assessment and for recognizing the significance of our work in elucidating the physical mechanisms underlying fundamental morphogenetic processes. We have striven to provide a comprehensive and rigorous analysis, and are grateful for this encouraging feedback.

      Weaknesses:

      (1) This reviewer has concerns about two aspects of the computational model. First, the model in Figure 5D shows a simulation of a flat epithelial sheet creating an invagination. However, the actual invagination is occurring in a small embryo that has significant curvature, such that nine or so cells occupy a 90-degree arc of the 360-degree circle that defines the embryo's cross-section (e.g., see Figure 1A). This curvature could have important effects on cell behavior.

      Thank you for bringing up the issue of tissue curvature. In this initial version of the model, we treated the tissue as flat because although the anterior epidermis indeed has significant curvature, the region that actually undergoes invagination occupies only a small arc of the embryo's cross-section—roughly 30-degree arc of the 360-degree circle. In addition, the embryo elongates anisotropically, and by 16.5 hpf the curvature has largely diminished (Fig.1A), leaving this local region effectively flattened. We agree that this simplification may overlook contributions from early curvature, and we will examine curvature changes more carefully in the data and incorporate curved geometry into the model to evaluate their impact.

      (2) The second concern about the model is that Figure 5 D shows the vertex model developing significant "puckering" (bulging) surrounding the invagination. Such "puckering" is not seen in the in vivo invagination (Figure 1A, 2A). This issue is not discussed in the text, so it is unclear how big an issue this is for the developed model, but the model does not recapitulate all aspects of the siphon invagination system.

      Thank you for pointing out the issue regarding the accuracy of the deformation pattern in our simulations. We do observe a mild puckering in vivo around 17 hpf (Fig. 1A), but it is clearly less pronounced than in the current model. The presence of such deformation suggests that bending stiffness of the epithelial sheet contributes to the mechanics of the invagination, which is included in our current model. While the discrepancy reflects limitations in our mechanical assumptions and geometric simplifications, including oversimplified interactions between the apical cell layer and the underlying basal cells, as well as the omission of tissue curvature. We will refine these aspects in the revised model to better reproduce the deformation patterns observed in vivo.

      (3) In Figure 2A, Top View, and the schematic in Figure 2C, the developing invagination is surrounded by a ring of aligned cell edges characteristic of a "purse string" type actomyosin cable that would create pressure on the invaginating cells, which has been documented in multiple systems. Notably, the schematic in Figure 2C shows myosin II localizing to aligned "purse string" edges, suggesting the purse string is actively compressing the more central cells. If the purse string consistently appears during siphon invagination, a complete understanding of siphon invagination will require understanding the contributions of the purse string to the invagination process.

      Thank you for this excellent observation. We agree that the ring-like actomyosin structure is a prominent feature during the initial stages of invagination, and its potential role warrants discussion. We carefully re-examined our data. Our analysis confirms that this myosin ring is most pronounced during the early initial invagination stage (approximately 13-14 hpf). This inward compression from the periphery would work in concert with apical constriction to help shape the initial invagination. However, this ring-like myosin pattern significantly diminishes in the accelerated invagination stage. We feel that the purse string may play a collaborative role in the early phase, however, its dissolution at the accelerated invagination stage indicates that Ciona atrial siphon invagination does not entirely rely on the sustained compression from the purse string of surrounding cells. These data will be included in the supplementary materials.

      (4) The introduction and discussion put the work in the context of work on physical forces in invagination, but there is not much discussion of how the modeling fits into the literature.

      We apologize for not providing sufficient context on how our theoretical framework relates to prior work on the mechanics of invagination. You are absolutely right that the Introduction and Discussion sessions should more clearly situate our model within the existing literature, including the classical formulations it builds upon and the more recent models that address similar morphogenetic processes. In the revision, we will expand this section to acknowledge relevant work, clarify how our approach connects to and differs from previous models, and explicitly discuss the strengths and limitations of our framework. We appreciate this helpful suggestion and will make these connections much clearer.

      Reviewer #2 (Public review):

      Summary:

      The authors propose that bidirectional translocation of actomyosin drives tissue invagination in Ciona siphon tube formation. They suggest a two-stage model where actomyosin first accumulates apically to drive a slow initial invagination, followed by translocation to lateral domains to accelerate the invagination process through cell shortening. They have shown that actomyosin activity is important for invagination - modulation of myosin activity through expression of myosin mutants altered the timing and speed of invagination; furthermore, optogenetic inhibition of myosin during the transition of the slow and fast stages disrupted invagination. The authors further developed a vertex model to validate the relationship between contractile force distribution and epithelial invagination.

      Thank you for your thoughtful and accurate summary of our work and for your constructive critique.

      Strengths:

      (1) The authors employed various techniques to address the research question, including optogenetics, the use of MRLC mutants, and vertex modelling.

      (2) The authors provide quantitative analyses for a substantial portion of their imaging data, including cell and tissue geometry parameters as well as actin and myosin distributions. The sample sizes used in these analyses appear appropriate.

      (3) The authors combined experimental measurements with computer modeling to test the proposed mechanical models, which represents a strength of the study. It provides a framework to explore the mechanical principles underlying the observed morphogenesis.

      We are grateful for your positive assessment of the multidisciplinary approaches, quantitative analyses, and the integration of modeling with experiments.

      Weaknesses:

      (1) The concept of coordinated and sequential action of apical and lateral actomyosin in support of epithelial folding has been documented through a combination of experimental and modeling approaches in other contexts, such as ascidian endoderm invagination (PMID: 20691592) and gastrulation in Drosophila (PMIDs: 21127270, 22511944, 31273212). While the manuscript addresses an important question, related findings have been reported in these previous studies. This overlap reduces the degree of novelty, and it remains to be clarified how their work advances beyond these prior contributions.

      We thank you for raising this important point regarding the novelty of our work and for directing us to the key literature on ascidian endoderm invagination (PMID: 20691592) and Drosophila gastrulation (PMIDs: 21127270, 22511944, 31273212). We agree with the reviewer that the sequential activation of contractility in different cellular domains is a fundamental mechanism driving epithelial morphogenesis, as elegantly demonstrated in these prior studies. Our work builds upon this foundational concept. However, we believe we reveals a novel and distinct mechanical model: The ascidian endoderm and the atrial siphon involve a sequential shift of actomyosin contractility. However, the spatial pattern and functional outcomes are fundamentally different. In the ascidian endoderm (PMID: 20691592), the transition is from apical constriction to basolateral contraction. Basolateral contraction works in concert with a persistent circumferential to overcome tissue resistance and drive invagination. In contrast, our study of the atrial siphon reveals a bidirectional actomyosin redistribution between the apical and lateral domains. The basal domain in our system appears to play a more passive, structural role. While, Drosophila gastrulation also involves apical and lateral myosin, the mechanisms and dependencies differ. As supported by recent work (Guo et al. elife 2022), ventral furrow invagination can proceed even when lateral contractility is compromised, indicating that it is not an absolute requirement. In our system, however, optogenetic inhibition and our vertex model strongly suggest that the acquisition of lateral contractility is essential for the accelerated invagination stage. We will revise the text to better articulate these points of distinction and novelty in the Introduction and Discussion sections.

      (2) One of the central statements made by the authors is that the translocation of actomyosin between the apical and lateral domains mediates invagination. The use of the term "translocation" infers that the same actomyosin structures physically move from one location to another location, which is not demonstrated by the data. Given the time scale of the process (several hours), it is also possible that the observed spatiotemporal patterns of actomyosin intensity result from sequential activation/assembly and inactivation/disassembly at specific locations on the cell cortex, rather than from the physical translocation of actomyosin structures over time.

      Your critique regarding the term "translocation" was well-founded. We will replace “translocation” with the more accurate and conservative term “redistribution” throughout the manuscript, including in the title. We will also revise the text in the Results and Discussion sections to avoid overinterpretation.

      (3) Some aspects of the data on actomyosin localization require further clarification. (1) The authors state that actomyosin translocation is bidirectional, first moving from the lateral domain to the apical domain; however, the reduction of the lateral actomyosin at this step was not rigorously tested. (2) During the slow invagination stage, it is unclear whether myosin consistently localizes to the apical cell-cell borders or instead relocalizes to the medioapical domain, as suggested by the schematic illustration presented in Figure 2C. (3) It is unclear how many cells along the axis orthogonal to the furrow accumulate apical and lateral myosin.

      Thank you for your insightful comments, which will help us significantly improve the clarity and rigor of our actomyosin localization analysis. To address the points raised, we will undertake several key revisions: First, we will add new quantitative analyses of active myosin intensity from earlier time points (13-14 hpf) to rigorously support the initial lateral-to-apical redistribution phase. Second, we will correct the schematic in Figure 2C to accurately reflect the predominant localization of active myosin at the apical cell-cell borders. Finally, we will clarify that the actomyosin redistribution occurs within a broader domain of approximately 15-20 cells in the invagination primordium, not being restricted to the single central cell on which our quantitative measurements were focused.

      (4) The overexpression of MRLC mutants appears to be rather patchy in some cases (e.g., in Figure 3A, 17.0 hpf, only cells located at the right side of the furrow appeared to express MRLC T18ES19E). It is unclear how such patchy expression would impact the phenotype.

      Thank you for your observation. We acknowledge that mosaic expression is common in Ciona electroporation. For all quantitative analyses, we only selected embryos in which the central cell, along with more than half of the surrounding cells in the primordium, showed clear expression of the plasmid.

      (5) In the optogenetic experiment, it appears that after one hour of light stimulation, the apical side of the tissue underwent relaxation (comparing 17 hpf and 16 hpf in Figure 4B). It is therefore unclear whether the observed defect in invagination is due to apical relaxation or lack of lateral contractility, or both. Therefore, the phenotype is not sufficient to support the authors' statement that "redistribution of myosin contractility from the apical to lateral regions is essential for the development of invagination".

      We agree that our optogenetic inhibition experiment does not distinguish between apical and lateral roles. To directly address this point, we will perform additional experiments in which we conduct the optogenetic inhibition and subsequently fix and stain the embryos for active myosin and F-actin. This will allow us to quantitatively compare the distribution of actomyosin in the light-stimulated experimental group versus the dark control group. We expect that light activation will have a more pronounced inhibitory effect on the lateral domains than on the apical domain, as the latter is naturally undergoing a reduction in contractility at this stage.

      (6) The vertex model is designed to explore how apical and lateral tensions contribute to distinct morphological outcomes. While the authors raise several interesting predictions, these are not further tested, making it unclear to what extent the model provides new insights that can be validated experimentally. In addition, modeling the epithelium as a flat sheet and not accounting for cell curvature is a simplification that may limit the model's accuracy. Finally, the model does not fully recapitulate the deeply invaginated furrow configuration as observed in a real embryo (comparing 18 hpf in Figure 5D and 18 hpf in Figure 1A) and does not fully capture certain mutant phenotypes (comparing 18 hpf in Figure 5F and 18 hpf in Figure 3B right panel).

      Thank you for raising these important points. We agree that several model predictions require stronger experimental grounding, and that the flat-sheet assumption is an oversimplification that likely contributes to the model not fully capturing certain morphological features. Our current simulations of myosin perturbation are largely consistent with the optogenetic experiments and the behavior of the myosin mutant. However, the predictions obtained by theoretically decoupling apical and lateral tension are difficult to validate experimentally, given the challenges of selectively manipulating these two components in vivo. Based on your helpful suggestions, we will extend the model to incorporate tissue curvature and examine how initial bending influences the mechanics of invagination, which we expect will improve the accuracy of the model’s morphological predictions.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript by Qiao et al., the authors seek to uncover force and contractility dynamics that drive tissue morphogenesis, using the Ciona atrial siphon primordium as a model. Specifically, the authors perform a detailed examination of epithelial folding dynamics. Generally, the authors' claims were supported by their data, and the conceptual advances may have broader implications for other epithelial morphogenesis processes in other systems.

      Thank you for your positive summary and for recognizing the broader implications of our work.

      Strengths:

      The strengths of this manuscript include the variety of experimental and theoretical methods, including generally rigorous imaging and quantitative analyses of actomyosin dynamics during this epithelial folding process, and the derivation of a mathematical model based on their empirical data, which they perturb in order to gain novel insights into the process of epithelial morphogenesis.

      Thank you for highlighting the strengths of our multidisciplinary methodology.

      Weaknesses:

      There are concerns related to wording and interpretations of results, as well as some missing descriptions and details regarding experimental methods.

      We will revise the manuscript to address your concerns regarding wording and methodological details. Your feedback led us to improve clarity, precision, and the depth of methodological description throughout the text.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review)

      Summary: 

      In the paper, the authors investigate how the availability of genomic information and the timing of vaccine strain selection influence the accuracy of influenza A/H3N2 forecasting. The manuscript presents three key findings: 

      (1) Using real and simulated data, the authors demonstrate that shortening the forecasting horizon and reducing submission delays for sharing genomic data improve the accuracy of virus forecasting. 

      (2) Reducing submission delays also enhances estimates of current clade frequencies. 

      (3) Shorter forecasting horizons, for example, allowed by the proposed use of "faster" vaccine platforms such as mRNA, resulting in the most significant improvements in forecasting accuracy. 

      Strengths: 

      The authors present a robust analysis, using statistical methods based on previously published genetic-based techniques to forecast influenza evolution. Optimizing prediction methods is crucial from both scientific and public health perspectives. The use of simulated as well as real genetic data (collected between April 1, 2005, and October 1, 2019) to assess the effects of shorter forecasting horizons and reduced submission delays is valuable and provides a comprehensive dataset. Moreover, the accompanying code is openly available on GitHub and is well-documented. 

      Thank you for this summary! We worked hard to make this analysis robust, reproducible, and open source.

      Weaknesses: 

      While the study addresses a critical public health issue related to vaccine strain selection and explores potential improvements, its impact is somewhat constrained by its exclusive reliance on predictive methods using genomic information, without incorporating phenotypic data. The analysis remains at a high level, lacking a detailed exploration of factors such as the genetic distance of antigenic sites.

      We are glad to see this acknowledgment of the critical public health issue we've addressed in this project. The goal for this study was to test effects of counterfactual scenarios with realistic public health interventions and not to introduce methodological improvements to forecasting methods. The final forecasting model we analyzed in this study (lines 301-330 and Figure 6) was effectively an "oracle" model that produced the optimal forecast for each given current and future timepoint. We expect any methodological improvements to forecasting models to converge toward the patterns we observed in this final section of the results.

      We've addressed the reviewer's concerns in more detail in response to their numbered comments 4 and 5 below.

      Another limitation is the subsampling of the available dataset, which reduces several tens of thousands of sequences to just 90 sequences per month with even sampling across regions. This approach, possibly due to computational constraints, might overlook potential effects of regional biases in clade distribution that could be significant. The effect of dataset sampling on presented findings remains unexplored. Although the authors acknowledge limitations in their discussion section, the depth of the analysis could be improved to provide a more comprehensive understanding of the underlying dynamics and their effects.

      We have addressed this comment in the numbered comment 1 below.

      Suggestions to enhance the depth of the manuscript: 

      Thank you again for these thoughtful suggestions. They have encouraged us to revisit aspects of this project that we had overlooked by being too close to it and have helped us improve the paper's quality.

      (1) Subsampling and Sampling Strategies: It would be valuable to comment on the rationale behind the strong subsampling of the available GISAID data. A discussion of the potential effects of different sampling strategies is necessary. Additionally, assessing the stability of the results under alternative sequence sampling strategies would strengthen the robustness of the conclusions. 

      We agree with the reviewer's point that our subsampled sequences only represent a fraction of those available in the GISAID EpiFlu database and that a more complete representation would be ideal. We designed the subsampling approach we used in this study for two primary reasons.

      (1) First, we sought to minimize known regional and temporal biases in sequence availability. For example, North America and Europe are strongly overrepresented in the GISAID EpiFlu database, while Africa and Asia are underrepresented (Figure 1A). Additionally, the number of sequences in the database has increased every year since 2010, causing later years in this study period to be overrepresented compared to earlier years. A major limitation of our original forecasting model from Huddleston et al. 2020 is its inability to explicitly estimate geographic-specific clade fitnesses. Because of this limitation, we trained that original model on evenly subsampled sequences across space and time. We used the same approach in this study to allow us to reuse that previously trained forecasting model. Despite this strong subsampling approach, we still selected an average of 50% of all available sequences across all 10 regions and the entire study period (Figure 1B). Europe and North America were most strongly downsampled with only 7% and 8% of their total sequences selected for the study, respectively. In contrast, we selected 91% of all sequences from Southeast Asia.

      (2) Second, our forecasting model relies on the inference of time-scaled phylogenetic trees which are computationally intensive to infer. While new methods like CMAPLE (Ly-Trong et al. 2024) would allow us to rapidly infer divergence trees, methods to infer time trees still do not scale well to more than ~20,000 samples. The subsampling approach we used in this study allowed us to build the 35 six-year H3N2 HA trees we needed to test our forecasting model in a reasonable amount of time.

      We have expanded our description of this rationale for our subsampling approach in the discussion and described the potential effects of geographic and temporal biases on forecasting model predictions (lines 360-376). Our original discussion read:

      "Another immediate improvement would be to develop models that can use all available data in a way that properly accounts for geographic and temporal biases. Current models based on phylogenetic trees need to evenly sample the diversity of currently circulating viruses to produce unbiased trees in a reasonable amount of time. Models that could estimate sample fitness and compare predicted and future populations without trees could use more available sequence data and reduce the uncertainty in current and future clade frequencies."

      The section now reads:

      "Another immediate improvement would be to develop models that can use all available data in a way that properly accounts for geographic and temporal biases. For example, virus samples from North America and Europe are overrepresented in the GISAID EpiFlu database, while samples from Africa and Asia are underrepresented (McCarron et al. 2022). As new H3N2 epidemics often originate from East and Southeast Asia and burn out in North America and Europe (Bedford et al. 2015), models that do not account for this geographic bias are more likely to incorrectly predict the success of lower fitness variants circulating in overrepresented regions and miss higher fitness variants emerging from underrepresented regions. Additionally, the number of H3N2 HA sequences per year in the GISAID EpiFlu database has increased consistently since 2010, creating a temporal bias where any given season a model forecasts to will have more sequences available than the season from which forecasts occur. The model we used in this study does not explicitly account for geographic variability of viral fitness and relies on time-scaled phylogenetic trees which can be computationally costly to infer for large sample sizes. As a result, we needed to evenly sample the diversity of currently circulating viruses to produce unbiased trees in a reasonable amount of time. Models that could estimate viral fitness per geographic region without inferring trees could use more available sequence data and reduce the uncertainty in current and future clade frequencies."

      We also added a brief explanation of our subsampling method to the corresponding section of the methods (lines 411-415). These lines read:

      "This sampling approach accounts for known regional biases in sequence availability through time (McCarron et al. 2022) and makes inference of divergence and time trees computationally tractable. This approach also exactly matches our previous study where we first trained the forecast models used in this study (Huddleston et al. 2020), allowing us to reuse those previously trained models."

      Although our forecast model is limited to a small proportion of sequences that we evenly sample across regions and time, we agree that we could improve the robustness of our conclusions by repeating our analysis for different subsets of the available data. To assess the stability of the results under alternative sequence sampling strategies, we ran a second replicate of our entire analysis of natural H3N2 populations with three times as many sequences per month (270) than our original replicate. With this approach, we selected between 17% (Europe) and 97% (Southeast Asia) of all sequences per region with an average of 72% and median of 83% (Figure 1C). We compared the effects of realistic interventions for this high-density subsampling analysis with the effects from the original subsampling analysis (Figure 6). We have added the results from this analysis to the main text (lines 313-321) which now reads:

      "For natural A/H3N2 populations, the average improvement of the vaccine intervention was 1.1 AAs and the improvement of the surveillance intervention was 0.27 AAs or approximately 25% of the vaccine intervention. The average improvement of both interventions was only slightly less than additive at 1.28 AAs. To verify the robustness of these results, we replicated our entire analysis of A/H3N2 populations using a subsampling scheme that tripled the number of viruses selected per month from 90 to 270 (Figure 1—figure supplement 4C). We found the same pattern with this replication analysis, with average improvements of 0.93 AAs for the vaccine intervention, 0.21 AAs for the surveillance intervention, and 1.14 AAs for both interventions (Figure 6—figure supplement 2)."

      We updated our revised manuscript to include the summary of sequences available and subsampled as Figure 1—figure supplement 4 and the effects of interventions with the high-density analysis as Figure 6—figure supplement 2. For reference, we have included Figure 2 showing both the original Figure 6 (original subsampling) and Figure 6—figure supplement 2 (high-density subsampling).

      (2) Time-Dependent Effects: Are there time-dependent patterns in the findings? For example, do the effects of submission lag or forecasting horizon differ across time periods, such as [2005-2010, +2010-2015,2015-2018]? This analysis could be particularly interesting given the emergence of co-circulation of clades 3c.2 and 3c.3 around 2012, which marked a shift to less "linear" evolutionary patterns over many years in influenza A/H3N2. 

      This is an interesting question that we overlooked by focusing on the broader trends in the predictability of A/H3N2 evolution. The effects of realistic interventions that we report in Figure 6 span future timepoints of 2012-04-01 to 2019-10-01. Since H1N1pdm emerged in 2009 and 3c3 started cocirculating with 3c2 in 2012, we can't inspect effects for the specific epochs mentioned above. However, there have been many periods during this time span where the number of cocirculating clades varied in ways that could affect forecast accuracy. The streamgraph, Author response image 1, shows the variation in clade frequencies from the "full tree" that we used to define clades for A/H3N2 populations.

      Author response image 1.

      Streamgraph of clade frequencies for A/H3N2 populations demonstrating variability of clade cocirculation through time.

      We might expect that forecasting models would struggle to accurately predict future timepoints with higher clade diversity, since much of that diversity would not have existed at the time of the forecast. We might also expect faster surveillance to improve our ability to detect that future variation by detecting those variants at low frequency instead of missing them completely.

      To test this hypothesis, we calculated the Shannon entropy of clade frequencies per future timepoint represented in Figure 6 (under no submission lag) and plotted the change in optimal distance to the predicted future by the entropy per timepoint. If there was an effect of future clade complexity on forecast accuracy, we expected greater improvements from interventions to be associated with higher future entropy.

      There was a trend for some of the greatest improvements per intervention to occur at higher future clade entropy timepoints, but we didn’t find a strong relationship between clade entropy and improvement in forecast accuracy by any intervention (Figure 4). The highest correlation was for improved surveillance (Pearson r=0.24).

      We have added this figure to the revised manuscript as Figure 6—figure supplement 3 and updated the results (lines 321-323) to reflect the patterns we described above. The updated results (which partially includes our response to the next reviewer comment) read:

      "These effects of realistic interventions appeared consistent across the range of genetic diversity at future timepoints (Figure 6—figure supplement 3) and for future seasons occurring in both Northern and Southern Hemispheres (Figure 6—figure supplement 4)."

      (3) Hemisphere-Specific Forecasting: Do submission lags or forecasting horizons show different performance when predicting Northern versus Southern Hemisphere viral populations? Exploring this distinction could add significant value to the analysis, given the seasonal differences in influenza circulation.

      Similar to the question above, we can replot the improvements in optimal distances to the future for the realistic interventions, grouping values by the hemisphere that has an active season in each future timepoint. Much like we expected forecasts to be less accurate when predicting into a highly diverse season, we might also expect forecasts to be less accurate when predicting into a season for a more densely populated hemisphere. Specifically, we expected that realistic interventions would improve forecast accuracy more for Northern Hemisphere seasons than Southern Hemisphere seasons. For this analysis, we labeled future timepoints that occurred in October or January as "Northern" and those that occurred in April or July as "Southern". We plotted effects of interventions on optimal distances to the future by intervention and hemisphere.

      In contrast to our original expectation, we found a slightly higher median improvement for the Southern Hemisphere seasons under both of the interventions that improved the vaccine timeline (Figure 5). The median improvement for the combined intervention was 1.42 AAs in the Southern Hemisphere and 0.93 AAs in the Northern Hemisphere. Similarly, the improvement with the "improved vaccine" intervention was 1.03 AAs in the South and 0.74 AAs in the North. However, the range of improvements per intervention was greater for the Northern Hemisphere across all interventions. The median increase in forecast accuracy was similar for both hemispheres in the improved surveillance intervention, with a single Northern Hemisphere season showing an unusually greater improvement that was also associated with higher clade entropy (Figure 4). These results suggest that both an improved vaccine development timeline and more timely sequence submissions would most improve forecast accuracy for Southern Hemisphere seasons compared to Northern Hemisphere seasons.

      We have added this figure to the revised manuscript as Figure 6—figure supplement 4 and updated the results (lines 321-326) to reflect the patterns we described above. The new lines in the results read:

      "These effects of realistic interventions appeared consistent across the range of genetic diversity at future timepoints (Figure 6—figure supplement 3) and for future seasons occurring in both Northern and Southern Hemispheres (Figure 6—figure supplement 4). We noted a slightly greater median improvement in forecast accuracy associated with both improved vaccine interventions for the Southern Hemisphere seasons (1.03 and 1.42 AAs) compared to the Northern Hemisphere seasons (0.74 and 0.93 AAs)."

      (4) Antigenic Sites and Submission Delays: It would be interesting to investigate whether incorporating antigenic site information in the distance metric amplifies or diminishes the observed effects of submission delays. Such an analysis could provide a first glance at how antigenic evolution interacts with forecasting timelines. 

      This would be an interesting area to explore. One hypothesis along these lines would be that if 1) viruses with more substitutions at antigenic sites are more likely to represent the future population and 2) viruses with more antigenic substitutions originate in specific geographic locations and 3) submissions of sequences for those viruses are more likely to be lagged due to their geographic origin, then 4) decreasing submission lags should improve our forecasting accuracy by detecting antigenically-important sequences earlier. If there is not a direct link between viruses that are more likely to represent the future and higher submission lags, we would not expect to see any additional effect of reducing submission lags for antigenic sites. Based on our work in Huddleston et al. 2020, it is also not clear that assumption 1 above is consistently true, since the specific antigenic sites associated with high fitness change over time. In that earlier work, we found that models based on these antigenic (or "epitope") sites could only accurately predict the future when the relevant sites for viral success were known in advance. This result was shown by our "oracle" model which accurately predicted the future during the model validation period when it knew which sites were associated with success and failed to predict the future in the test period when the relevant sites for success had changed (Figure 6).

      To test the hypothesis above, we would need sequences to have submission lags that reflect their geographic origin. For this current study, we intentionally decoupled submission lags from geographic origin to allow inclusion of historical A/H3N2 HA sequences that were originally submitted as part of scientific publications and not as part of modern routine surveillance. As a result, the original submission dates for many sequences are unrealistically lagged compared to surveillance sequences.

      (5) Incorporation of Phenotypic Data: The authors should provide a rationale for their choice of a genetic-information-only approach, rather than a model that integrates phenotypic data. Previous studies, such as Huddleston et al. (2020, eLife), demonstrate that models combining genetic and phenotypic data improve forecasts of seasonal influenza A/H3N2 evolution. It would be interesting to probe the here observed effects in a more recent model.

      The primary goal of this study was not to test methodological improvements to forecasting models but to test the effects of realistic public health policy changes that could alter forecast horizons and sequence availability. Most influenza collaborating centers use a "sequence-first" approach where they sequence viral isolates first and use those sequences to prioritize viruses for phenotypic characterization (Hampson et al. 2017). The additional lag in availability of phenotypic data means that a forecasting model based on genetic and phenotypic data will necessarily have a greater lag in data availability than a model based on genetic data only. Since the policy changes we're testing in this study only affect the availability of sequence data and not phenotypic data, we chose to test the relative effects of policy changes on sequence-based forecasting models.

      We have updated the abstract (lines 18-26 and 30-32), introduction (lines 87-88), and discussion (lines 332-334) to emphasize the focus of this study on effects of policy changes. The updated abstract lines read as follows with new content in bold:

      "Despite continued methodological improvements to long-term forecasting models, these constraints of a 12-month forecast horizon and 3-month average submission lags impose an upper bound on any model's accuracy. The global response to the SARS-CoV-2 pandemic revealed that the adoption of modern vaccine technology like mRNA vaccines can reduce how far we need to forecast into the future to 6 months or less and that expanded support for sequencing can reduce submission lags to GISAID to 1 month on average. To determine whether these public health policy changes could improve long-term forecasts for seasonal influenza, we quantified the effects of reducing forecast horizons and submission lags on the accuracy of forecasts for A/H3N2 populations. We found that reducing forecast horizons from 12 months to 6 or 3 months reduced average absolute forecasting errors to 25% and 50% of the 12-month average, respectively. Reducing submission lags provided little improvement to forecasting accuracy but decreased the uncertainty in current clade frequencies by 50%. These results show the potential to substantially improve the accuracy of existing influenza forecasting models through the public health policy changes of modernizing influenza vaccine development and increasing global sequencing capacity."

      The updated introduction now reads:

      "These technological and public health policy changes in response to SARS-CoV-2 suggest that we could realistically expect the same outcomes for seasonal influenza."

      The updated discussion now reads:

      "In this work, we showed that realistic public health policy changes that decrease the time to develop new vaccines for seasonal influenza A/H3N2 and decrease submission lags of HA sequences to public databases could improve our estimates of future and current populations, respectively."

      We have also updated the introduction (lines 57-65) and the discussion (lines 345-348) to specifically address the use of sequence-based models instead of sequence-and-phenotype models. The updated introduction now reads:

      "For this reason, the decision process is partially informed by computational models that attempt to predict the genetic composition of seasonal influenza populations 12 months in the future (Morris et al. 2018). The earliest of these models predicted future influenza populations from HA sequences alone (Luksza and Lassig 2014, Neher et al. 2014, Steinbruck et al. 2014). Recent models include phenotypic data from serological experiments (Morris et al. 2018, Huddleston et al. 2020, Meijers et al. 2023, Meijers et al. 2025). Since most serological experiments occur after genetic sequencing (Hampson et al. 2017) and all forecasting models depend on HA sequences to determine the viruses circulating at the time of a forecast, sequence availability is the initial limiting factor for any influenza forecasts."

      The updated discussion now reads:

      "Since all models to date rely on currently available HA sequences to determine the clades to be forecasted, we expect that decreasing forecast horizons and submission lags will have similar relative effect sizes across all forecasting models including those that integrate phenotypic and genetic data."

      Reviewer #2 (Public review): 

      Summary: 

      The authors have examined the effects of two parameters that could improve their clade forecasting predictions for A(H3N2) seasonal influenza viruses based solely on analysis of haemagglutinin gene sequences deposited on the GISAID Epiflu database. Sequences were analysed from viruses collected between April 1, 2005 and October 1, 2019. The parameters they investigated were various lag periods (0, 1, 3 months) for sequences to be deposited in GISAID from the time the viruses were sequenced. The second parameter was the time the forecast was accurate over projecting forward (for 3,6,9,12 months). Their conclusion (not surprisingly) was that "the single most valuable intervention we could make to improve forecast accuracy would be to reduce the forecast horizon to 6 months or less through more rapid vaccine development". This is not practical using conventional influenza vaccine production and regulatory procedures. Nevertheless, this study does identify some practical steps that could improve the accuracy and utility of forecasting such as a few suggested modifications by the authors such as "..... changing the start and end times of our long-term forecasts. We could change our forecasting target from the middle of the next season to the beginning of the season, reducing the forecast horizon from 12 to 9 months.' 

      Strengths: 

      The authors are very familiar with the type of forecasting tools used in this analysis (LBI and mutational load models) and the processes used currently for influenza vaccine virus selection by the WHO committees having participated in a number of WHO Influenza Vaccine Consultation meetings for both the Southern and Northern Hemispheres. 

      Weaknesses: 

      The conclusion of limiting the forecasting to 6 months would only be achievable from the current influenza vaccine production platforms with mRNA. However, there are no currently approved mRNA influenza vaccines, and mRNA influenza vaccines have also yet to demonstrate their real-world efficacy, longevity, and cost-effectiveness and therefore are only a potential platform for a future influenza vaccine. Hence other avenues to improve the forecasting should be investigated. 

      We recognize that there are no approved mRNA influenza vaccines right now. However, multiple mRNA vaccines have completed phase 3 trials indicating that these vaccines could realistically become available in the next few years. A primary goal of our study was to quantify the effects of switching to a vaccine platform with a shorter timeline than the status quo. Our results should further motivate the adoption of any modern vaccine platform that can produce safe and effective vaccines more quickly than the egg-passaged standard. We have updated the introduction (lines 88-91) to note the mRNA vaccines that have completed phase 3 trials. The new sentence in the introduction reads:

      "Work on mRNA vaccines for influenza viruses dates back over a decade (Petsch et al. 2012, Brazzoli et al. 2016, Pardi et al. 2018, Feldman et al. 2019), and multiple vaccines have completed phase 3 trials by early 2025 (Soens et al. 2025, Pfizer 2022)."

      While it is inevitable that more influenza HA sequences will become available over time a better understanding of where new influenza variants emerge would enable a higher weighting to be used for those countries rather than giving an equal weighting to all HA sequences. 

      This is definitely an important point to consider. The best estimates to date (Russell et al. 2008, Bedford et al. 2015) suggest that most successful variants emerge from East or Southeast Asia. In contrast, most available HA sequence data comes from Europe and North America (Figure 1A). Our subsampling method explicitly tries to address this regional bias in data availability by evenly sampling sequences from 10 different regions including four distinct East Asian regions (China, Japan/Korea, South Asia, and Southeast Asia). Instead of weighting all HA sequences equally, this sampling approach ensures that HA sequences from important distinct regions appear in our analysis.

      We have updated our methods (lines 411-423) to better describe the motivation of our subsampling approach and proportions of regions sampled with our original approach (90 viruses per month) and a second high-density sampling approach (270 viruses per month). These new lines read:

      "This sampling approach accounts for known regional biases in sequence availability through time (McCarron et al. 2022) and makes inference of divergence and time trees computationally tractable. This approach also exactly matches our previous study where we first trained the forecast models used in this study (Huddleston et al. 2020), allowing us to reuse those previously trained models. With this subsampling approach, we selected between 7% (Europe) and 91% (Southeast Asia) of all available sequences per region across the entire study period with an average of 50% and median of 52% across all 10 regions (Figure 1—figure Supplement 4). To verify the reproducibility and robustness of our results, we reran the full forecasting analysis with a high-density subsampling scheme that selected 270 sequences per month with the same even sampling across regions and time as the original scheme. With this approach, we selected between 17% (Europe) and 97% (Southeast Asia) of all available sequences per region with an average of 72% sampled and a median of 83% (Figure 1—figure Supplement 4C)."

      We added Figure 1—figure Supplement 4 to document the regional biases in sequence availability and the proportions of sequences we selected per region and year.

      Also, other groups are considering neuraminidase sequences and how these contribute to the emergence of new or potentially predominant clades.

      We agree that accounting for antigenic evolution of neuraminidase is a promising path to improving forecasting models. We chose to focus on hemagglutinin sequences for several reasons, though. First, hemagglutinin is the only protein whose content is standardized in the influenza vaccine (Yamayoshi and Kawaoka 2019), so vaccine strain selection does not account for a specific neuraminidase. Additionally, as we noted in response to Reviewer 1 above, the goal of this study was to test effects of counterfactual scenarios with realistic public health interventions and not to introduce methodological improvements to forecasting models like the inclusion of neuraminidase sequences.

      We have updated the introduction to provide the additional context about hemagglutinin's outsized role in the current vaccine development process (lines 40-44):

      "The dominant influenza vaccine platform is an inactivated whole virus vaccine grown in chicken eggs (Wong and Webby, 2013) which takes 6 to 8 months to develop, contains a single representative vaccine virus per seasonal influenza subtype including A/H1N1pdm, A/H3N2, and B/Victoria (Morris et al., 2018), and for which only the HA protein content is standardized (Yamayoshi and Kawaoka, 2019)."

      We have updated the abstract (lines 18-26 and 30-32), introduction (lines 87-88), and discussion (lines 332-334) to emphasize our goal of testing effects of public health policy changes on forecasting accuracy rather than methodological changes. The updated abstract lines read as follows with new content in bold:

      "Despite continued methodological improvements to long-term forecasting models, these constraints of a 12-month forecast horizon and 3-month average submission lags impose an upper bound on any model's accuracy. The global response to the SARS-CoV-2 pandemic revealed that the adoption of modern vaccine technology like mRNA vaccines can reduce how far we need to forecast into the future to 6 months or less and that expanded support for sequencing can reduce submission lags to GISAID to 1 month on average. To determine whether these public health policy changes could improve long-term forecasts for seasonal influenza, we quantified the effects of reducing forecast horizons and submission lags on the accuracy of forecasts for A/H3N2 populations. We found that reducing forecast horizons from 12 months to 6 or 3 months reduced average absolute forecasting errors to 25% and 50% of the 12-month average, respectively. Reducing submission lags provided little improvement to forecasting accuracy but decreased the uncertainty in current clade frequencies by 50%. These results show the potential to substantially improve the accuracy of existing influenza forecasting models through the public health policy changes of modernizing influenza vaccine development and increasing global sequencing capacity."

      The updated introduction now reads:

      "These technological and public health policy changes in response to SARS-CoV-2 suggest that we could realistically expect the same outcomes for seasonal influenza."

      The updated discussion now reads:

      "In this work, we showed that realistic public health policy changes that decrease the time to develop new vaccines for seasonal influenza A/H3N2 and decrease submission lags of HA sequences to public databases could improve our estimates of future and current populations, respectively."

      Figure 1a. I don't understand why the orange dot 1-month lag appears to be on the same scale as the 3-month/ideal timeline. 

      We apologize for the confusion with this figure. Our original goal was to show how the two factors in our study design (forecast horizons and sequence submission lags) interact with each other by showing an example of 3-month forecasts made with no lag (blue), ideal lag (orange), and realistic lag (green). To clarify these two factors, we have removed the two lines at the 3-month forecast horizon for the ideal and realistic lags and have updated the caption to reflect this simplification. The new figure looks like this:

      The authors should expand on the line "The finding of even a few sequences with a potentially important antigenic substitution could be enough to inform choices of vaccine candidate viruses." While people familiar with the VCM process will understand the implications of this statement the average reader will not fully understand the implications of this statement. Not only will it inform but it will allow the early production of vaccine seeds and reassortants that can be used in conventional vaccine production platforms if these early predictions were consolidated by the time of the VCM. This is because of the time it takes to isolate viruses, make reassortants and test them - usually a month or more is needed at a minimum. 

      Thank you for pointing out this unclear section of the discussion. We have rewritten this section, dropping the mention of prospective measurements of antigenic escape which now feels off-topic and moving the point about early detection of important antigenic substitutions to immediately follow the description of the candidate vaccine development timeline. This new placement should clarify the direct causal relationship between early detection and better choices of vaccine candidates. The original discussion section read:

      "For example, virologists must choose potential vaccine candidates from the diversity of circulating clades well in advance of vaccine composition meetings to have time to grow virus in cells and eggs and measure antigenic drift with serological assays (Morris et al., 2018; Loes et al., 2024). Similarly, prospective measurements of antigenic escape from human sera allow researchers to predict substitutions that could escape global immunity (Lee et al., 2019; Greaney et al., 2022; Welsh et al., 2023). The finding of even a few sequences with a potentially important antigenic substitution could be enough to inform choices of vaccine candidate viruses."

      The new section (lines 386-391) now reads:

      "For example, virologists must choose potential vaccine candidates from the diversity of circulating clades months in advance of vaccine composition meetings to have time to grow virus in cells and eggs and measure antigenic drift with serological assays (Morris et al. 2018; Loes et al. 2024). Earlier detection of viral sequences with important antigenic substitutions could determine whether corresponding vaccine candidates are available at the time of the vaccine selection meeting or not."

      A few lines in the discussion on current approaches being used to add to just the HA sequence analysis of H3N2 viruses (ferret/human sera reactivity) would be welcome.

      We have added the following sentences to the last paragraph (lines 391-397) to note recent methodological advances in estimating influenza fitness and the relationship these advances have to timely genomic surveillance.

      "Newer methods to estimate influenza fitness use experimental measurements of viral escape from human sera (Lee et al., 2019; Welsh et al., 2024; Meijers et al., 2025; Kikawa et al., 2025), measurements of viral stability and cell entry (Yu et al., 2025), or sequences from neuraminidase, the other primary surface protein associated with antigenic drift (Meijers et al., 2025). These methodological improvements all depend fundamentally on timely genomic surveillance efforts and the GISAID EpiFlu database to identify relevant influenza variants to include in their experiments."

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      van der Linden et al. report on the development of a new green-fluorescent sensor for calcium, following a novel rational design strategy based on the modification of the cyan-emissive sensor mTq2-CaFLITS. Through a mutational strategy similar to the one used to convert EGFP into EYFP, coupled with optimization of strategic amino acids located in proximity of the chromophore, they identify a novel sensor, GCaFLITS. Through a careful characterization of the photophysical properties in vitro and the expression level in cell cultures, the authors demonstrate that G-CaFLITS combines a large lifetime response with a good brightness in both the bound and unbound states. This relative independence of the brightness on calcium binding, compared with existing sensors that often feature at least one very dim form, is an interesting feature of this new type of sensors, which allows for a more robust usage in fluorescence lifetime imaging. Furthermore, the authors evaluate the performance of G-CaFLITS in different subcellular compartments and under two-photon excitation in Drosophila. While the data appears robust and the characterization thorough, the interpretation of the results in some cases appears less solid, and alternative explanations cannot be excluded.

      Strengths:

      The approach is innovative and extends the excellent photophysical properties of the mTq2-based to more red-shifted variants. While the spectral shift might appear relatively minor, as the authors correctly point out, it has interesting practical implications, such as the possibility to perform FLIM imaging of calcium using widely available laser wavelengths, or to reduce background autofluorescence, which can be a significant problem in FLIM.

      The screening was simple and rationally guided, demonstrating that, at least for this class of sensors, a careful choice of screening positions is an excellent strategy to obtain variants with large FLIM responses without the need of high-throughput screening.

      The description of the methodologies is very complete and accurate, greatly facilitating the reproduction of the results by others, or the adoption of similar methods. This is particularly true for the description of the experimental conditions for optimal screening of sensor variants in lysed bacterial cultures.

      The photophysical characterization is very thorough and complete, and the vast amount of data reported in the supporting information is a valuable reference for other researchers willing to attempt a similar sensor development strategy. Particularly well done is the characterization of the brightness in cells, and the comparison on multiple parameters with existing sensors.

      Overall, G-CaFLITS displays excellent properties for a FLIM sensor: very large lifetime change, bright emission in both forms and independence from pH in the physiological range.

      Weaknesses:

      The paper demonstrates the application of G-CaFLITS in various cellular subcompartments without providing direct evidence that the sensor's response is not affected by the targeting. Showing at least that the lifetime values in the saturated state are similar in all compartments would improve the robustness of the claims.

      In some cases, the interpretation of the results is not fully convincing, leaving alternative hypotheses as a possibility. This is particularly the case for the claim of the origin of the strongly reduced brightness of G-CaFLITS in Drosophila. The explanation of the intensity changes of G-CaFLITS also shows some inconsistency with the basic photophysical characterization.

      While the claims generally appear robust, in some cases they are conveyed with a lack of precision. Several sentences in the introduction and discussion could be improved in this regard. Furthermore, the use of the signal-to-noise ratio as a means of comparison between sensors appears to be imprecise, since it is dependent on experimental conditions.

      We thank the reviewer for a thorough evaluation and for suggestions to improve our manuscript. We are happy with the recognition of the strengths of this work. The list with weaknesses has several valid points which will be addressed in a point-by-point reply and a revision.

      Reviewer #2 (Public review):

      Summary:

      Van der Linden et al. describe the addition of the T203Y mutation to their previously described fluorescence lifetime calcium sensor Tq-Ca-FLITS to shift the fluorescence to green emission. This mutation was previously described to similarly red-shift the emission of green and cyan FPs. Tq-Ca-FLITS_T203Y behaves as a green calcium sensor with opposite polarity compared with the original (lifetime goes down upon calcium binding instead of up). They then screen a library of variants at

      two linker positions and identify a variant with slightly improved lifetime contrast (TqCa-FLITS_T203Y_V27A_N271D, named G-Ca-FLITS). The authors then characterize the performance of G-Ca-FLITS relative to Tq-Ca-FLITS in purified protein samples, in cultured cells, and in the brains of fruit flies.

      Strengths:

      This work is interesting as it extends their prior work generating a calcium indicator scaffold for fluorescent protein-based lifetime sensors with large contrast at a single wavelength, which is already being adopted by the community for production of other FLIM biosensors. This work effectively extends that from cyan to green fluorescence. While the cyan and green sensors are not spectrally distinct enough (~20-30nm shift) to easily multiplex together, it at least shifts the spectra to wavelengths that are more commonly available on commercial microscopes.

      The observations of organellar calcium concentrations were interesting and could potentially lead to new biological insight if followed up.

      Weaknesses:

      (1) The new G-Ca-FLITS sensor doesn't appear to be significantly improved in performance over the original Tq-Ca-FLITS, no specific benefits are demonstrated.

      (2) Although it was admirable to attempt in vivo demonstration in Drosophila with these sensors, depolarizing the whole brain with high potassium is not a terribly interesting or physiological stimulus and doesn't really highlight any advantages of their sensors; G-Ca-FLITS appears to be quite dim in the flies.

      We thank the reviewer for a thorough evaluation and for suggestions to improve our manuscript. Although the spectral shift of the green variant is modest, we have added new data (figure 7) to the manuscript that demonstrates multiplex imaging of G-Ca-FLITS and Tq-Ca-FLITS.

      As for the listed weaknesses we respond here:

      (1) Although we agree that the performance in terms of dynamic range is not improved, the advantage of the green sensor over the cyan version is that the brightness is high in both states.

      (2) We agree that the performance of G-Ca-FLITS is disappointing in Drosophila. We feel that this is important data to report, and it makes it clear that Tq-Ca-FLITS is a better choice for this system. Depolarization of the entire brain was done to measure the maximal lifetime contrast.

      Reviewer #3 (Public review):

      Summary:

      The authours present a variant of a previously described fluorescence lifetime sensor for calcium. Much of the manuscript describes the process of developing appropriate assays for screening sensor variants, and thorough characterization of those variants (inherent fluorescence characteristics, response to calcium and pH, comparisons to other calcium sensors). The final two figures show how the sensor performs in cultured cells and in vivo drosophila brains.

      Strengths:

      The work is presented clearly and the conclusion (this is a new calcium sensor that could be useful in some circumstances) is supported by the data.

      Weaknesses:

      There are probably few circumstances where this sensor would facilitate experiments (calcium measurements) that other sensors would prove insufficient.

      We thank the reviewer for the evaluation of our manuscript. As for the indicated weakness, we agree that the main application of genetically encoded calcium biosensors is to measure qualitative changes in calcium. However, it can be argued that due to a lack of tools the absolute quantification has been very challenging. Now, thanks to large contrast lifetime biosensors the quantitative measurements are simplified, there are new opportunities, and the probe reported here is an improvement over existing probes as it remains bright in both states, further improving quantitative calcium measurements.

      Reviewer #1 (Recommendations for the authors):

      While the science in the paper appears solid, the methods well grounded and excellently documented, the manuscript would benefit from a revision to improve the clarity of the exposition. In particular:

      Part of the introduction appears like a patchwork of information with poor logical consequentiality. The authors rapidly pass from the impact of brightness on FLIM accuracy, to mitochondrial calcium in pathology, to the importance of the sensor's affinity, to a sentence on sensor's kinetics, to fluorescent dyes and bioluminescence, to conclude that sensors should be stable at mitochondrial pH. I highly recommend rewriting this part.

      We thank the referee for the comment and we have adjusted to introduction to better connect the parts and increase the logic. The updated introduction addresses all the feedback by the reviewers on different aspects of the introductory text, and we have removed the section on dyes and bioluminescence. We feel that the introduction is better structured now.

      The reference to particular amino acid positions would greatly benefit from including images of the protein structure in which the positions are highlighted, similar to what the same authors do in their fluorescent protein development papers. While in the case of sensors a crystal structure might be lacking, highlighting the positions with respect to an AlphaFold-generated structure or the structure of mTq2 might still be helpful.

      We appreciate this remark and we have added a sequence alignment of the FLITS probes to supplemental Figure S4. This shows the residues with number, and we have also highlighted the different domains, linkers and mutations. We think that this linear representation works better than a 3D structure (one issue is that alphafold fails to display the chromophore and it has usually poor confidence for linker residues).

      The use of SNR, as defined by the authors (mean of the lifetime divided by standard deviation) appears a poorly suited parameter to compare sensors, as it depends on the total number of collected photons and on the strength of the algorithms used to retrieve the lifetime value. In an extreme example, if one would collect uniform images with millions of photons per pixel, most likely SNR would be extremely good for all sensors in all states, irrespective of the fact that some states are dimmer (within reasonable limits). On the other hand, if the same comparison would be performed at a level of thousands or hundreds of photons per pixel, the effect of different brightness on the SNR would be much more dramatic. While in general I fully agree with the core concept of the paper, i.e. that avoiding low-brightness forms leads more easily to experiments with higher SNR, I would suggest to stick to comparing the sensors in terms of brightness and refer to SNR (if needed) only when describing the consequences on measurements.

      The reviewer is right that in absolute terms the SNR is not meaningful. In addition to acquisition time, it depends on expression levels. Yet, it is possible to compare the change in SNR between the apo- and saturated states, and that is what is shown in figure 5. We have added text to better explain that the change in SNR is relevant here:

      “The absolute SNR is not relevant here, as it will depend on the expression level and acquisition time. But since we have measured the two extremes in the same cells, we can evaluate how the SNR changes between these states for each separate probe”

      Some statements from the authors or aspects of the paper appear problematic:

      (1) "Additionally, the fluorescence of most sensors is a non-linear function of calcium concentration, usually with Hill coefficients between 2 and 3. This is ideal when the probe is used as a binary detector for increases in Ca2+ concentrations, but it makes robust quantification of low, or even intermediate, calcium concentrations extremely challenging."

      To the best of my knowledge, for all sensors the fluorescence response is a nonlinear function of calcium concentrations. If the authors have specific examples in mind in which this is not true, they should cite them specifically. Furthermore, the Hill coefficient defines the range of concentrations in which the sensor operates, while the fact that "low concentrations" might be hard to detect depends only on the dim fluorescence of some sensors in the unbound form.

      We agree with the reviewer that this part is not clearly written and confusing, as the sentence “Additionally, the fluorescence of most sensors is a non-linear function of calcium concentration, usually with Hill coefficients between 2 and 3” was not relevant in this section and so we removed it. Now it reads:

      “Many GECIs harboring a single fluorescent protein (FP), like GCaMPs, are optimized for a large intensity change, and have a (very) dim state when calcium levels are below the KD of the probe (Akerboom et al., 2013; Dana et al., 2019; Shen et al., 2018; Zhang et al., 2023; Zhao et al., 2011). This is ideal when the probe is used as a binary detector for increases in Ca2+ concentrations, but it makes robust quantification of low, or even intermediate, calcium concentrations extremely challenging”

      (2) "The affinity of a sensor is of major importance: a low KD can underestimate high concentrations and vice versa."

      It is not clear to me why the concentrations would be underestimated, rather than just being less precise. Also, if a calibration curve is plotted in linear scale rather than logarithmic scale, it appears that the precision problem is much more severe near saturation (where low lifetime changes result in large concentration changes) than near zero (where low concentration changes produce large lifetime changes).

      We agree that this could be better explained, what we meant to say that concentrations that are ~10x lower or higher than the KD cannot be precisely measured. See also our reply to the next comment.

      (3) "Differences can also arise due to the method of calibration, i.e. when the absolute minimum and maximum signal are not reached in the calibration procedure (Fernandez-Sanz et al., 2019)."

      Unless better explained, this appears obvious and not worth mentioning.

      What may be obvious to the reviewer (and to us) may not be obvious to the reader, and that’s why this is included. To make it clearer we rephrased this part as a list of four items:

      “Accurate determination of the affinity of a sensor is important and there are several issues that need to be considered during the calibration and the measurements: (i) the concentrations can only be measured with sufficient precision when it is in the range between 10x K<sub>D</sub> and 1/10x K<sub>D</sub>, (ii) the calibration is only valid when the two extremes are reached during the calibration procedure (Fernandez-Sanz et al., 2019), (iii) the sensor’s kinetics should be sufficiently fast enough to be able to track the calcium changes, and (iv) the biosensor should be compatible with the high mitochondrial pH of 8 (Cano Abad et al., 2004; Llopis et al., 1998).”

      (4) In the experiments depicted in Figure 6C the underlying assumption is that the sensor behaves in the same way independently of the compartment to which it is targeted. This is not necessarily the case. It would be valuable to see the plots of Figure 6C and D discussed in terms of lifetime. Is the saturating lifetime value the same in all compartments?

      This is a valid point and we have now included a plot with the actual lifetime data for each of the organelles (figure S15). 

      We have also added text to discuss this point: “We note that the underlying assumption of the quantification of organellar calcium concentrations is that the lifetime contrast is the same. This is broadly true for most of the measurements (Figure S15). Yet, there are also differences. It is currently unclear whether the discrepancies are due to differences in the physicochemical properties of the compartments, or whether there is a technical reason (the efficiency of ionomycin for saturating the biosensor in the different compartments is unknown, as far as we know). This is something that is worth revisiting. A related issue that deserves attention is the level of agreement between in vitro and in vivo calibrations.”

      (5) A similar problem arises for the observation of different calcium levels in peripheral mitochondria. In figure S11b, the values of the two lifetime components of a biexponential fit are displayed. Both the long and short components seem to be different. This is an interesting observation, as in an ideal sensor (in which the "long lifetime conformation" is the same whether the sensor is bound to the analyte or not, and similarly for the short lifetime one) those values should be identical. While it is entirely possible that this is not the case for G-CaFLITS, since the authors have conducted a calibration experiment using time-domain FLIM, could they show the behavior of the lifetimes and preamplitudes? Are the trends consistent with their interpretation of a different calcium level in the two mitochondrial populations?

      We have analyzed the calibration data from TCSPC experiments done with the Leica Stellaris. From these data (acquired at high photon counts as it is purified protein in solution), we infer that both the short and long lifetime do change as a function of calcium concentration. In particular the long lifetime shows a substantial change, which we cannot explain at this moment. We agree that this is interesting and may potentially give insight in the conformation changes that give rise to the lifetime change.

      The lifetime data of the mitochondria has been acquired with a different FLIM setup, but the trend is consistent, both the long and short lifetime decrease in the peripheral mitochondria that have a higher calcium concentration.

      Author response image 1

      (6) "The lifetime response of Tq-Ca-FLITS and the ΔF/F response of jGCaMP7f resembled each other, with both signals gradually increasing over the span of 3-4 minutes after we increased external [K+]; the two signals then hit a plateau for ~1 min, followed by a return to baseline and often additional plateaus (Figure 8B-C). By comparison, G-Ca-FLITS responses were more variable, typically exhibiting a smaller ramping phase and seconds-long spikes of activity rather than minutes-long plateaus (Figure 8C)."

      This statement does not appear fully consistent with the data in Figure 8. While in figure 8B it looks like GCaMP and mTq-CaFLITS have very similar profiles, these curves come from one single experiment out of a very variable dataset (see Figure 8C). If one would for example choose the second curve of GCaMP in Figure 8C, it would look very similar to the response of G-CaFLITS in figure 8B, and the argument would be reversed. How do the averages look like?

      Indeed, the dynamics of the responses are very variable and we do not want to draw attention to these differences in the dynamics, so we have removed the comparison. Instead, the difference in intensity change and lifetime contrast are of importance here. To answer the question of the reviewer, we have added a new panel (D) which shows the average responses for each of the GECIs.  

      (7) "Although the calibration is equipment independent under ideal conditions, and only needs to be performed once, we prefer to repeat the calibration for different setups to account for differences in temperature or pulse frequency."

      While I generally agree with the statement, it is imprecise. A change in temperature is generally expected to affect the Kd, so rather than "preferring to repeat", it is a requirement for accurate quantification at different concentrations. I am not sure I understand what the pulse frequency is in this context, and how it affects the Kd.

      We thank the referee for pointing out that our text is imprecise and confusing. What we meant to say is that we see differences between different set-ups and we have clarified this by changing the text. We have also added that it is “necessary” to repeat the calibration:

      “Although the calibration is equipment independent under ideal conditions, and only needs to be performed once, we do see differences between different set-ups. Therefore, it is necessary to repeat the calibration for different set-ups.”

      (8) "A recent effort to generate a green emitting lifetime biosensor used a GFP variant as a template (Koveal et al., 2022), and the resulting biosensor was pH sensitive in the physiological range. On the other hand, biosensors with a CFP-like chromophore are largely pH insensitive (van der Linden et al., 2021; Zhong et al., 2024)."

      The dismissal of the use of T-Sapphire as a pH independent template is inaccurate. The same group has previously reported other sensors (SweetieTS for glucose and Peredox for redox ratio) that are not pH sensitive. Furthermore, in Koveal et al. also many of the mTq2-based variants showed a pH response, suggesting that the pHdependence for the Lilac sensor might be more complex. Still, G-CaFLITS present advantages in terms of the possibility to excite at longer wavelengths, which could be mentioned instead.

      We only want to make the point that adding the T203Y mutation to Turquoise-based lifetime biosensors may be a good approach for generating pH insensitive green biosensors. There is no point in dismissing other green biosensors and we have changed the text to: “Since biosensors with a CFP-like chromophore are largely pH insensitive (van der Linden et al., 2021; Zhong et al., 2024), and we show here that the pH independence is retained for the Green Ca-FLITS, we expect that adding the T203Y mutation to a cyan sensor is a good approach for generating pH-insensitive green lifetime-based sensors.”

      (9) "Usually, a higher QY results in a higher intensity; however, in G-Ca-FLITS the open state has a differential shaped excitation spectrum which leads to a decreased intensity. These effects combined have resulted in a sensor where the two different states have a similar intensity despite displaying a large QY and lifetime contrast."

      This statement does not seem to reflect the excitation spectra of Figure 1. If this explanation would be true, wouldn't there be an isoemissive point in the excitation spectrum (i.e. an excitation wavelength at which emission intensity would not change)?

      The excitation spectra in figure 1 are not ideal for the interpretation as these are not normalized. The normalized spectra are shown in figure S10, but for clarity we show the normalized spectra here below as well. For the FD-FLIM experiments we used a 446 nm LED that excites the calcium bound state more efficiently. Therefore, the lower brightness due to a lower QY of the calcium bound state is compensated by increased excitation. So the limited change in intensity is excitation wavelength dependent. We have added a sentence to the discussion to stress this:

      “The smallest intensity change is obtained when the calcium-bound state is preferably excited (i.e. near 450 nm) and the effect is less pronounced when the probe is excited near its peak at 474 nm”   

      (10) "We evaluated the use of Tq-Ca-FLITS and G-Ca-FLITS for 2P-FLIM and observed a surprisingly low brightness of the green variant in an intact fly brain. This result is consistent with a study finding that red-shifted fluorescent-protein variants that are much brighter under one-photon excitation are, surprisingly, dimmer than their blue cousins in multi-photon microscopy (Molina et al., 2017). The responses of both probes were in line with their properties in single photon FLIM, but given the low brightness of G-Ca-FLITS under 2-photon excitation, the Tq-Ca-FLITS may be a better choice for 2P-FLIM experiments."

      The differences appear strikingly high, and it seems improbable that a reduction in two-photon absorption coefficient might be the sole cause. How can the authors rule out a problem in expression (possibly organism-specific)?

      The reviewers are correct that the changes in brightness between G-Ca-FLITS and Tq-Ca-FLITS may arise from changes in expression levels. It is difficult to calibrate for these changes explicitly without a stable reference fluorophore. However, both the G-Ca-FLITS and Tq-Ca-FLITS transgenic flies produced used the same plasmid backbone (the Janelia 20x-UAS-IVS plasmid), landed in the same insertion site (VK00005) of the same genetic background and were crossed to the same Janelia driver line (R60D05-Gal4), so at the level of the transcriptional machinery or genetic regulatory landscape the two lines are probably identical except for the few base pair differences between the G-Ca-FLITS and Tq-Ca-FLITS sequence. But the same level of transcription may not correspond to the same amount of stable protein in the ellipsoid body. So, we cannot rule out any organism-specific problems in expression. To examine the 2P excitation efficiency relative to 1P excitation efficiency, we have measured the fluorescence intensity of purified G-Ca-FLITS and Tq-Ca-FLITS on beads. See also response to reviewer 3 and supplemental figure S14

      Suggestions

      (1) The underlying assumption of any experiment using a biosensor is that the concentration of the biosensor should be roughly 2 orders of magnitude lower than the concentration of the analyte, otherwise the calibration equations do not hold. When measuring nM concentrations of calcium, this problem can be in principle very significant, as the concentration of the sensor in cells is likely in the low micromolar range. Calcium regulation by the cell should compensate for the problem, and the equations should hold. However, this might not hold true during experimental conditions that would disrupt this tight regulation. It might be a good thing to add a sentence to inform users about the limitations in interpreting calcium concentration data under such conditions.

      Good point. We have added this to the discussion: “All calcium indicators also act as buffers, and this limits the accuracy of the absolute measurements, especially for the lower calcium concentrations (Rose et al., 2014), as the expression of the biosensor is usually in the low micromolar range.”

      (2) Different methods of lifetime "averaging", such as intensity or amplitude-weighted lifetime in time domain FLIM or phase and modulation in frequency domain might lead to different Kd in the same calibration experiment. This is an underappreciated factor that might lead to errors by users. Since the authors conducted calibrations using both frequency and time-domain, it would be useful to mention this fact and maybe add a table in the Supporting Information with the minima, maxima and Kds calculated using different lifetime averaging methods.

      To avoid biases due to fitting we prefer to use the phasor plot, this can be used for both frequency and time-domain methods and we added a sentence to the discussion to highlight this: “We prefer to use the phasor analysis (which can be used for both frequency- and time-domain FLIM), as it makes no assumptions about the underlying decay kinetics.”

      (3) The origin of the redshift observed in G-CaFLITS is likely pi-stacking, similar to the EGFP-to-EYFP case. While previous studies suggest that for mTq2 based sensors a change in rigidity would lead to a change in the non-radiative rate, which would result in similar changes in quantum yield and (amplitude-weighted average) lifetime. If pi-stacking plays a role, there could be an additional change in the radiative rate (as suggested also by the change in absorption spectra). Could this play a role in the relation between brightness and lifetime in G-CaFLITS? Given the extensive data collected by the authors, it should be possible to comment on these mechanistical aspects, which would be useful to guide future design.

      We do appreciate this suggestion, but we currently do not have the data to answer this question. The inverted response that we observe, solely due to the introduction of the tyrosine is puzzling. Perhaps introduction of the mutation that causes the redshift in other cyan probes will provide more insight.

      Reviewer #2 (Recommendations for the authors):

      Specific points:

      The first section of Results is basically a description of how they chose the lysis conditions for screening in bacteria. I didn't see anything particularly novel or interesting about this, anyone working with protein expression in bacteria likely needs to optimize growth, lysis, purification, etc. This section should be moved to the Methods.

      As reviewer 1 lists the thorough documentation of this approach as one of the strengths, we prefer to keep it like this. We see this section as method development, rather than purely a method. When this section would be moved to methods, it remains largely invisible and we think that’s a shame. Readers that are not interested can easily skip this section.

      In the Results section Characterization of G-Ca-FLITS, the authors state "Here, the calcium affinity was KD = 339 nM, higher compared to the calibration at 37{degree sign}C. This is in line with the notion that binding strength generally increases with decreasing temperature." However, the opposite appears to be true - at 37C they measured a KD of 209 nM which would represent higher binding strength at higher temperature.

      Thanks for catching this, we’ve made a mistake. We rephrase this to “higher compared to the calibration at 37 ˚C. This is unexpected as it not in line with the notion that binding strength generally increases with decreasing temperature.”

      In Figure 8c, there should be a visual indicator showing the onset of application of high potassium, as there is in 8b.

      This is a good suggestion; a grey box is added to indicates time when high K+ saline was perfused.

      Reviewer #3 (Recommendations for the authors):

      I think the science of the manuscript is sound and the presentation is logical and clear. I have some stylistic recommendations.

      Supp Fig 1: The figure requires a bit of "eyeballing" to decide which conditions are best, and figuring out which spectra matched the final conditions took a little effort. Is there a way to quantify the fluorescence yield to better show why the one set of conditions was chosen? If it was subjective, then at least highlight the final conditions with a box around the spectra, making it a different colour, or something to make it stand out.

      Thanks for the comment; we added a green box.

      Supp Fig 3: Similar suggestion. Highlight the final variant that was carried forward (T203Y). The subtle differences in spectra are hard to discern when they are presented separately. How would it look if they were plotted all on one graph? Or if each mutant were presented as a point on a graph of Peak Em vs Peak Ex? Would T203Y be in the top right?

      We have added a light blue box for reference to make the differences clearer.

      Supp Fig 4 & Fig 1: Too much of the graph show the uninteresting tails of the spectra and condenses the interesting part. Plotting from 400 nm to 600 nm would be more informative.

      We appreciate the suggestion but disagree. We prefer to show the spectra in its entirety, including the tails. The data will be available so other plots can be made by anyone.

      Fig 3a: People who are not experts in lifetime analysis are probably not very familiar with the phase/modulation polar plot. There should be an additional sentence or two in the main text that _briefly_ describes the basis for making the polar plot and the transformation to the fractional saturation plot in 3B. I can't think of a good way to transform Eq 3 from Supp Info into a sentence, but that's what I think is needed to make this transformation clearer.

      We appreciate the suggestion and feel that it is well explained here:

      "The two extreme values (zero calcium and 39 μM free calcium) are located on different coordinates in the polar plot and all intermediate concentrations are located on a straight line between these two extremes. Based on the position in the polar plot, we determined the fraction of sensor in the calcium-bound state, while considering the intensity contribution of both states"  

      Fig 4: The figure is great, and I love the comparison of different calcium sensors. But where is Tq-Ca-FLITS? I get that this is a figure of green calcium sensors, but it would be nice to see Tq-Ca-FLITS in there as well. The G-Ca-FLITS is compared to Tq-Ca-FLITS in Fig 5. Maybe I'm just missing why the bottom panel of Fig 5 cannot be replotted and included in Fig 4.

      The point is that we compare all the data with identical filter sets, i.e. for green FPs.using these ex/em settings, the Tq probe would seriously underperform. Note that the data in fig. 5 is not normalized to a reference RFP and can therefore not be compared with data presented in figure 4.

      Fig 6: The BOEC data could easily be moved to Supp Figs. It doesn't contribute much relevant info.

      We are not keen of moving data to supplemental, as too often the supplemental data is ignored. Moreover, we think that the BOEC data is valuable (as BOEC are primary cells and therefore a good model of a healthy human cell) and deserves a place in the main manuscript.

      2P FLIM / Fig 8 / Fig S4: The lack of brightness of G-Ca-FLITS in the 2P FLIM of fruit fly brain could have been predicted with a 2P cross section of the purified protein. If the equipment to perform such measurements is available, it could be incorporated into Fig S4.

      Unfortunately, we do not have access to equipment that measures the 2P cross section. As an alternative, we compared the 2P excitation efficiency with 1P excitation efficiency. To this end, we have used beads that were loaded with purified G-Ca-FLITS or Tq-Ca-FLITS. We have evaluated the fluorescence intensity of the beads using 1P (460 nm) and 2P (920 nm) excitation. Although the absolute intensity cannot be compared (the G-Ca-FLITS beads have a lower protein concentration), we can compare the relative intensities when changing from 1P to 2P. The 2P excitation efficiency of G-Ca-FLITS is comparable (if not better) to that of Tq-Ca-FLITS. This excludes the option that the G-Ca-FLITS has poor 2P excitability. We will include this data as figure S12.

      We also have added text to the results: “We evaluated the relative brightness of purified Tq-Ca-FLITS and G-Ca-FLITS on beads by either 1-Photon Excitation (1PE) (at 460 nm) or 2-Photon Excitation (2PE) (at 920 nm) and observed a similar brightness between the two modes of excitations (figure S14). This shows that the two probes have similar efficiencies in 2PE and suggest that the low brightness of GCa-FLITS in Drosophila is due to lower expression or poor folding.” and discussion: “The responses of both probes were in line with their properties in single photon FLIM, but given the low brightness of G-Ca-FLITS under 2-photon excitation in Drosphila, the Tq-Ca-FLITS is a better choice in this system. Yet, the brightness of G-Ca-FLITS with 2PE at 920 nm is comparable to Tq-Ca-FLITS, so we expect that 2P-FLIM with G-Ca-FLITS is possible in tissues that express it well.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      Mancl et al. present cryo-EM structures of the Insulin Degrading Enzyme (IDE) dimer and characterize its conformational dynamics by integrating structures with SEC-SAXS, enzymatic activity assays, and all-atom molecular dynamics (MD) simulations. They present five cryo-EM structures of the IDE dimer at 3.0-4.1 Å resolution, obtained with one of its substrates, insulin, added to IDE in a 1:2 ratio. The study identified R668 as a key residue mediating the open-close transition of IDE, a finding supported by simulations and experimental data. The work offers a refined model for how IDE recognizes and degrades amyloid peptides, incorporating the roles of IDE-N rotation and charge-swapping events at the IDE-N/C interface. 

      Strengths: 

      The study by Mancl et al. uses a combination of experimental (cryoEM, SEC-SAXS, enzymatic assays) and computational (MD simulations, multibody analysis, 3DVA) techniques to provide a comprehensive characterization of IDE dynamics. The identification of R668 as a key residue mediating the open-to-close transition of IDE is a novel finding, supported by both simulations and experimental data presented in the manuscript. The work offers a refined model for how IDE recognizes and degrades amyloid peptides, incorporating the roles of IDE-N rotation and chargeswapping events at the IDE-N/C interface. The study identifies the structural basis and key residues for IDE dynamics that were not revealed by static structures. 

      Weaknesses: 

      Based on MD simulations and enzymatic assays of IDE, the authors claim that the R668A mutation in IDE affects the conformational dynamics governing the open-closed transition, which leads to altered substrate binding and catalysis. The functional importance of R668 would be substantiated by enzymatic assays that included some of the other known substrates of IDE than insulin such as amylin and glucagon. 

      We have included amyloid beta in our enzymatic assays, as shown in Figure 5D, and have updated the manuscript text accordingly. The R668A mutation results in a loss of dose-dependent competition with amyloid beta, but not with insulin. To further substantiate this unexpected finding, we plan to undertake a comprehensive biochemical characterization of the R668A mutation across a variety of substrates, followed by structural analysis of this mutant. However, these investigations are beyond the scope of the current study and, if successful, warrant a separate publication.

      It is unclear to what extent the force field (FF) employed in the MD simulations favors secondary structures and if the lack of any observed structural changes within the IDE domains in the simulations - which is taken to suggest that the domains behave as rigid bodies - stems from bias by the FF. 

      We utilized the widely adopted CHARMM36 force field, whose parameters have been validated by thousands of previous studies. As shown in Figure 2A, our simulations reveal small but noticeable fluctuations in intradomain RMSD values. However, after careful examination, we found that these changes do not correspond to any biologically meaningful motions based on previously reported structural and biophysical characterizations of IDE (e.g., Shen et al., Nature 2006; Noinaj et al., PLOS One 2011; McCord et al., PNAS 2013; Zhang et al., eLife 2018, and references therein).

      Reviewer #2 (Public review): 

      Summary: 

      The manuscript describes various conformational states and structural dynamics of the Insulin degrading enzyme (IDE), a zinc metalloprotease by nature. Both open and closed-state structures of IDE have been previously solved using crystallography and cryo-EM which reveal a dimeric organization of IDE where each monomer is organized into N and C domains. C-domains form the interacting interface in the dimeric protein while the two N-domains are positioned on the outer sides of the core formed by Cdomains. It remains elusive how the open state is converted into the closed state but it is generally accepted that it involves large-scale movement of N-domains relative to the C-domains. The authors here have used various complementary experimental techniques such as cryo-EM, SAXS, size-exclusion chromatography, and enzymatic assays to characterize the structure and dynamics of IDE protein in the presence of substrate protein insulin whose density is captured in all the structures solved. The experimental structural data from cryo-EM suffered from a high degree of intrinsic motion among the different domains and consequently, the resultant structures were moderately resolved at 3-4.1 Å resolution. A total of five structures were generated by cryo-EM. The authors have extensively used Molecular dynamics simulation to fish out important inter-subunit contacts which involve R668, E381, D309, etc residues. In summary, authors have explored the conformational dynamics of IDE protein using experimental approaches which are complemented and analyzed in atomic details by using MD simulation studies. The studies are meticulously conducted and lay the ground for future exploration of the protease structure-function relationship. 

      Reviewer #1 (Recommendations for the authors): 

      The manuscript reads well, however, there are minor details throughout that would tighten it up and, in some cases, make it easier to approach for a broader readership: 

      Abstract 

      (1) R668 is referred to by its one-letter code throughout the main text but referred to as arginine-668 in the abstract. The abstract should be corrected to R668. 

      This has been corrected.

      (2) The authors should consider reordering the significance of their work as it is listed at the end of the abstract. As the work first and foremost "offers the molecular basis of unfoldase activity of IDE and provides a new path forward towards the development of substrate-specific modulators of IDE activity" these should come before "the power of integrating experimental and computational methodologies to understand protein dynamics". 

      We have revised abstract substantially to incorporate the new findings. Consequently, the sentence for "the power of integrating experimental and computational methodologies to understand protein dynamics" has been removed.  

      Main text 

      (1) Cryo-EM is consistently referred to as cryoEM throughout the text. The commonly accepted format for referring to cryogenic electron microscopy is cryo-EM. The authors are asked to consider revising the text accordingly. 

      The text has been revised.

      (2) Introduction: The authors are asked to consider including a figure (panel) that provides the general reader with an overview of IDE architecture and topology as a point of reference in the introduction to understanding the pseudo symmetry in IDE, domains, and IDE-C relative to IDE-N, etc. This is relevant for reading most of the figures. 

      We have added a new figure 1 to provide the background and questions to be answered.

      (3) The authors should consider renaming some of the headers in the results section to include the main conclusion. For instance, "CryoEM structures of IDE in the presence of a sub-saturating concentration of insulin" is not really helpful for the reader to understand the work, while "R668A mediates IDE conformational dynamics in vitro" is. 

      The headings have been altered in an effort to be more informative.

      (4) It is unclear what the timescale for insulin cleavage is for IDE. Clearly, it is possible for the authors to capture an insulin-bound IDE from within the 7 million particles, but what is the chance of this? The authors emphasize the IDE:insulin ratio relative to previous experiments, but surely the kinetics would be the same in the two experiments that were presumably set up exactly the same way. In the context of this, the authors should disclose how concentrations were estimated experimentally. The authors are encouraged to touch upon the subject of time scales to tie up cryo-EM and enzyme experiments with MD simulations. 

      Both reviewers posted the question about time-scale relevant to IDE catalysis. In response to this request, we have revised the manuscript to address the relevance of key kinetic timescales. Specifically, we now discuss the open/closed transition (~0.1 second) and insulin cleavage (~2/sec), both established experimentally in prior studies (McCord et al PNAS 2013). 

      IDE concentrations were determined by spectrometry (Nanodrop and/or Bradford assay), and its purity was confirmed to be greater than 90% by SDS-PAGE. Insulin was purchased commercially, weighed, and dissolved in buffer, with its concentration subsequently verified using Nanodrop. Catalytically inactive IDE and insulin were mixed and incubated for at least 30 minutes. Given IDE’s low nanomolar affinity for insulin, and the sub-stoichiometric insulin concentrations used, sufficient time was allowed for insulin to bind IDE and remain bound.

      To distinguish between IDE’s unfoldase and protease activities, all structural analyses were performed in the presence of EDTA, which chelates catalytic zinc, thereby inactivating IDE. This approach inhibits the enzyme’s catalytic cycle and allows us to capture the fully unfolded state of insulin bound to IDE in its closed conformation, representing the endpoint of the reaction. Under these conditions, the only meaningful kinetic parameter available for investigation was the unfolding of insulin by IDE.

      To elaborate the interaction between IDE and insulin in the catalytically relevant time regime, we investigated IDE–insulin interactions within the millisecond time regime by rapidly mixing IDE with a large molar excess of insulin for approximately 120 milliseconds for the cryo-EM single particle analysis. Under these conditions, we observed that both IDE subunits in the dimer predominantly adopt open states, which are distinct from those previously reported. This observation suggests a potential mechanism of allostery in IDE function. 

      (5) It should be included in the main text that the data was processed with C1 symmetry and not just in Table 1. This is more useful information for understanding the study than the number of micrographs.  

      We have stated that the data was processed with C1 symmetry at the start of the results section.

      (6) The authors should consider adding speculation on what the approximately 6 million particles that did not yield a high-resolution structure represent. 

      In cryo-EM single particle analysis, particle selection is typically performed automatically using software such as Relion. Due to the low signal-to-noise ratio, many “junk particles”—originating from contaminants such as ice, impurities, aggregates, or incomplete particles—are inevitably included along with the particles of interest. It is standard practice to filter out these junk particles during data processing. In our case, we estimate that the majority of the 6 million particles are likely junk. However, we cannot fully exclude the possibility that some of these particles may originate from IDE and carry potentially useful information about its conformational heterogeneity. Nonetheless, current cryo-EM single particle analysis methods face significant challenges in objectively recovering and interpreting such particles.

      Reviewer #2 (Recommendations for the authors): 

      I have some minor comments regarding the manuscript which are given below. 

      (1) For O/O state, it will be great to see an explanation regarding why the values are dissimilar for 0.5 and 0.143 FSC. 

      All of our IDE structures (including previously published data) demonstrate a dip/plateau at moderate resolution in their FSCs. We interpret this an indicator of structural heterogeneity, as the dip/plateau is smallest in the pC/pC state, becomes larger when one of the subunits is open, and is largest when both subunits are open. Because both subunits within the O/O state are highly heterogeneous, the FSC dipped below the 0.5 threshold. Other states, such as the O/pO, display the same FSC trend, the dip remains slightly above the 0.5 threshold.

      (2) O/pO state is moderately resolved at 4.1 Å, but this state is populated with many particles (328,870). Can the resolution be improved by more extensive sorting of heterogenous particles which intrinsically causes misalignment amongst particles? 

      Unfortunately, no. As shown by the local resolution maps in Figure 1-figure supplement 1, the primary source of misalignment is the IDE-N region in the open subunit. We have found that IDE-N is nearly unconstrained in its conformational flexibility in the open state, and does not appear to adopt discrete states, our attempts to better classify particles have failed. We speculate that this may be a failing in kmeans cluster based classification, and this is part of the driving force behind our exploration of advanced methods of heterogeneity analysis.

      (3) Given the observation that capturing a substrate-bound open state is difficult, it can be assumed that the substrate capture in the catalytic cleft is a fast event. Please comment on the possible time frame of unfolding of substrate and catalysis. Can authors comment on any cryo-EM experiments that can deal with such a short time frame? If there is a possibility to include data from such experiments, then it may be considered.

      This has been addressed in conjunction with the previous reviewer’s comment (see above). Specifically, we now discuss the open/closed transition (~0.1 second) and insulin cleavage (~2/sec), both established experimentally in prior studies. Additionally, we investigated IDE–insulin interactions by rapidly mixing IDE with a large molar excess of insulin for approximately 120 milliseconds for the cryo-EM single particle analysis. Under these conditions, we observed that both IDE subunits in the dimer predominantly adopt open states, which are distinct from those previously reported. This observation suggests a potential mechanism of allostery in IDE function. 

      (4) How long was incubation time after adding any substrates, such as insulin? Can different incubation times be tested to generate additional information regarding other conformational states that lie in between open and closed states?  

      The incubation time for IDE with insulin prior to cryo-EM grid freezing was approximately 30 minutes. We agree that it would be exciting to explore shorter time frames to identify new conformational states. As discussed above, we have rapidly mixed IDE with a large molar excess of insulin for approximately 120 milliseconds for the cryo-EM single particle analysis. Under these conditions, we observed that both IDE subunits in the dimer predominantly adopt open states, which are distinct from those previously reported. This observation suggests a potential mechanism of allostery in IDE function.

      (5) A complex network of hydrogen bonding interaction initiated by R668 latching onto N-domain is mentioned in MD simulation studies but it is not clear why cryo-EM experiments did not capture such stabilized structures. 

      We believe that two main factors have prevented us from observing the hydrogen bonding network in our cryo-EM structures. The first factor is the requirement to freeze the sample in liquid ethane. According to the second law of thermodynamics, lowering the temperature reduces the effect of entropy. Our findings suggest that residue R668 interacts with several neighboring residues through a network of polar and electrostatic interactions, rather than being limited to a single partner. These interactions facilitate both the open-closed transitions and rotational movements between IDE-N and IDE-C. From a thermodynamic perspective, these interactions have both enthalpic and entropic components, and cooling the sample diminishes the entropic contribution. In line with this, we observe that the closed-state domains in our cryo-EM studies are positioned closer together than in our MD simulations, though not as tightly as in crystal structures of IDE. This implies that cryogenic data collection may constrain the interface between IDE-N and IDE-C, which can further alter the equilibrium for the network of R668 mediated interactions.

      Secondly, our cryo-EM structures represent ensemble averages of tens to hundreds of thousands of particles. MD simulations indicate that IDE-N and IDE-C can rotate relative to one another, resulting in considerable variability in residue interactions. However, the level of particle density in our cryo-EM data does not permit sufficiently fine classification to resolve these differences. As a result, distinct hydrogen bonding networks are likely averaged out in the ensemble structure, particularly in the case of R668, which is indicated to interact with multiple neighboring residues in the conformation-dependent manner. This averaging effect may also contribute to our inability to achieve resolutions below 3 Å.

      (6) Despite the observation that IDE is an intrinsically flexible protein, it seems probable that differently-sized substrates might reveal additional interaction networks formed by other novel key players apart from just R668. Will it be helpful to first try this computationally using MD simulations and then try to replicate this in cryo-EM experiments? If needed, additional simulation time may be added to the MD analysis. Please comment!  

      We agree that this is an exciting avenue to explore. Doubly so when considered in light of our R668A enzymatic results with amyloid beta. However, several challenges must be overcome before we can explore this direction effectively:

      (1) We lack experimental knowledge of the initial interaction event between IDE and substrate. All substrate-bound IDE structures have been obtained after unfolding and positioning for cleavage has occurred. Without a solid foundational model for the initial interaction event between IDE and substrate, the interpretation of subsequent MD simulations is open to question.

      (2) We have previously observed minimal effect of substrate on IDE in all-atom MD simulations. We believe that observable effects would require a much longer time scale than is currently achievable with all-atom MD, so have turned to Upside, a coarse-grained method to overcome these limitations, but Upside handles side chains with presumptive modeling, which prevent the identification of potential novel residue interactions.

      (3) Due to the conformational heterogeneity present within IDE cryo-EM datasets, we struggle to obtain sufficient resolution to clearly identify side chain interactions at the domain interface (see response to 5).

      Given these challenges, we plan to explore these directions in future manuscripts.

      (7) What is the possibility of water interaction networks and dynamism in this network to contribute to the overall dynamics of the protein in the presence and absence of substrates? How symmetric these networks be in the four domains of dimeric IDE? 

      This is an interesting idea that we have begun to explore, but consider to be outside the scope of this work. Currently, we do not have any MD simulations containing substrate with explicit solvent (Upside uses implicit solvent), and solvent atoms were removed from our all-atom simulations prior to analysis to speed up processing. That being said, preliminary WAXS data suggests that there may be a difference in water interaction interfaces between WT and R668A IDE, and this is a lead we plan to pursue in future work.

      (8) Line 214: Please fix the typo which wrongly describes closed = pO. 

      This is not a typo, but it is confusing. The pO state has previously been defined as the closed state of IDE lacking bound substrate as determined by cryo-EM. This differentiates the pO state from the pC state, where the pC state contains density indicative of bound substrate. As the MD simulations were conducted with the apo-state, the closed state the simulations were initialized from was the pO state structure, which represents the substrate-free closed state as determined by cryo-EM. We realize that this difference is probably unnecessary to the majority of readers, and have removed the (pO) specificity to avoid confusion.

      (9) It is not clear why a cryo-EM structure was not attempted for the R668A mutant. If the authors have tried to generate such a structure, it should be mentioned in the manuscript. Such a structure should yield more information when compared to SAXS experiments.

      We have not attempted to obtain a cryo-EM structure for the R668A mutant. Our SAXS analysis suggests a transition from a dominant O/pO state to a dominant O/O state. The O/O state is known to exhibit the highest degree of conformational heterogeneity, which severely limits structural insights. We are working to better handle the sample preparation of IDE and perform such analysis without the need to use Fab. We plan to further characterize IDE R668A biochemically and potentially explore other mutations that would provide insights in how IDE works. Armed with that, we will perform the structural analysis of such IDE mutant(s).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The study conducted by the Schuldiner's group advances the understanding of mitochondrial biology through the utilization of their bi-genomic (BiG) split-GFP assay, which they had previously developed and reported. This research endeavors to consolidate the catalog of matrix and inner membrane mitochondrial proteins. In their approach, a genetic framework was employed wherein a GFP fragment (GFP1-10) is encoded within the mitochondrial genome. Subsequently, a collection of strains was created, with each strain expressing a distinct protein tagged with the GFP11 fragment. The reconstitution of GFP fluorescence occurs upon the import of the protein under examination into the mitochondria.

      We are grateful for the positive evaluation. We would like to clarify that the bi-genomic (BiG) split-GFP assay was developed by the labs of H. Becker and Roza Kucharzyk by highly laborious construction of the strain with mtDNA-encoded GFP<sub>1-10</sub> (Bader et al, 2020). 

      Strengths:

      Notably, this assay was executed under six distinct conditions, facilitating the visualization of approximately 400 mitochondrial proteins. Remarkably, 50 proteins were conclusively assigned to mitochondria for the first time through this methodology. The strains developed and the extensive dataset generated in this study serve as a valuable resource for the comprehensive study of mitochondrial biology. Specifically, it provides a list of 50 "eclipsed" proteins whose role in mitochondria remains to be characterized.

      Weaknesses:

      The work could include some functional studies of at least one of the newly identified 50 proteins.

      In response to this we have expanded the characterization of phenotypic effects resulting from changing the targeting signal and expression levels of the dually localized Gpp1 protein and expanded the data in Fig. 3, panels H and I.

      Reviewer #2 (Public Review):

      The authors addressed the question of how mitochondrial proteins that are dually localized or only to a minor fraction localized to mitochondria can be visualized on the whole genome scale. For this, they used an established and previously published method called BiG split-GFP, in which GFP strands 1-10 are encoded in the mitochondrial DNA and fused the GFP11 strand C-terminally to the yeast ORFs using the C-SWAT library. The generated library was imaged under different growth and stress conditions and yielded positive mitochondrial localization for approximately 400 proteins. The strength of this method is the detection of proteins that are dually localized with only a minor fraction within mitochondria, which so far has hampered their visualization due to strong fluorescent signals from other cellular localizations. The weakness of this method is that due to the localization of the GFP1-10 in the mitochondrial matrix, only matrix proteins and IM proteins with their C-termini facing the matrix can be detected. Also, proteins that are assembled into multimeric complexes (which will be the case for probably a high number of matrix and inner membrane-localized proteins) resulting in the C-terminal GFP11 being buried are likely not detected as positive hits in this approach. Taking these limitations into consideration, the authors provide a new library that can help in the identification of eclipsed protein distribution within mitochondria, thus further increasing our knowledge of the complete mitochondrial proteome. The approach of global tagging of the yeast genome is the logical consequence after the successful establishment of the BiG split-GFP for mitochondria. The authors also propose that their approach can be applied to investigate the topology of inner membrane proteins, however, for this, the inherent issue remains that it cannot be excluded that even the small GFP11 tag can impact on protein biogenesis and topology. Thus, the approach will not overcome the need to assess protein topology analysis via biochemical approaches on endogenous untagged proteins.

      Reviewer #3 (Public Review):

      Summary:

      Here, Bykov et al move the bi-genomic split-GFP system they previously established to the genomewide level in order to obtain a more comprehensive list of mitochondrial matrix and inner membrane proteins. In this very elegant split-GFP system, the longer GFP fragment, GFP1-10, is encoded in the mitochondrial genome and the shorter one, GFP11, is C-terminally attached to every protein encoded in the genome of yeast Saccharomyces cerevisiae. GFP fluorescence can therefore only be reconstituted if the C-terminus of the protein is present in the mitochondrial matrix, either as part of a soluble protein, a peripheral membrane protein, or an integral inner membrane protein. The system, combined with high-throughput fluorescence microscopy of yeast cells grown under six different conditions, enabled the authors to visualize ca. 400 mitochondrial proteins, 50 of which were not visualised before and 8 of which were not shown to be mitochondrial before. The system appears to be particularly well suited for analysis of dually localized proteins and could potentially be used to study sorting pathways of mitochondrial inner membrane proteins.

      Strengths:

      Many fluorescence-based genome-wide screens were previously performed in yeast and were central to revealing the subcellular location of a large fraction of yeast proteome. Nonetheless, these screens also showed that tagging with full-length fluorescent proteins (FP) can affect both the function and targeting of proteins. The strength of the system used in the current manuscript is that the shorter tag is beneficial for the detection of a number of proteins whose targeting and/or function is affected by tagging with full-length FPs.

      Furthermore, the system used here can nicely detect mitochondrial pools of dually localized proteins. It is especially useful when these pools are minor and their signals are therefore easily masked by the strong signals coming from the major, nonmitochondrial pools of the proteins.

      Weaknesses:

      My only concern is that the biological significance of the screen performed appears limited. The dataset obtained is largely in agreement with several previous proteomic screens but it is, unfortunately, not more comprehensive than them, rather the opposite. For proteins that were identified inside mitochondria for the first time here or were identified in an unexpected location within the organelle, it remains unclear whether these localizations represent some minor, missorted pools of proteins or are indeed functionally important fractions and/or productive translocation intermediates. The authors also allude to several potential applications of the system but do little to explore any of these directions.

      We agree with the reviewer that a single method may not be used for the construction of the complete protein inventory of an organelle or its sub-compartment. We suggest that the value of our assay is in providing a complementary view to the existing data and approaches. For example, we confirm the matrix localization of several proteins that were only found in the two proteomic data and never verified before (Vögtle et al, 2017; Morgenstern et al, 2017). Given that proteomics is a very sensitive technique and false positives are hard to completely exclude, our complementary verification is valuable.

      Reviewer #1 (Recommendations for the authors):

      In my opinion, the manuscript can be published as it is, and I would expect that future work will advance the functional properties of the newly found mitochondrial proteins.

      We thank the reviewer for their positive evaluation

      Reviewer #2 (Recommendations for the authors)

      (1) Due to the localization of the GFP1-10 in the matrix, only matrix and IM proteins with C-termini facing the matrix can be detected, this should be added e.g. in the heading of the first results part and discussed earlier in the manuscript. In addition, the limitation that assembly into protein complexes will likely preclude detection of matrix and IM proteins needs to be discussed.

      To address the first point, we edited the title of the first section to only mention the visualization of the matrix-facing proteome and remove the words “inner membrane”. We also clarified early in the Results section that we only consider the matrix-facing C-termini by extending the sentence early in the results section “To compare our findings with published data, we created a unified list of 395 proteins that are observed with high confidence using our assay indicating that their C-terminus is positioned in the matrix (Fig. 2 – figure supplement 1B-D, Table S1).” (P. 6 Lines 1-3). Concluding the comparison with the earlier proteomic studies we also added the sentence “Many proteins are missing because their C-termini are facing the IMS” (P.8 Line 2). 

      To address the second point concerning the possible interference of the complex assembly and protein detection by our assay, we conducted an additional analysis. The analysis takes advantage of the protein complexes with known structures where we could estimate if the C-terminus with the GFP<sub>11</sub> tag would be available for GFP1-10 binding. We added the additional figure (Figure 3 – figure supplement 2) and following text in the Results section (P.7 Lines 22-34): 

      “To examine the influence of protein complex assembly on the performance of the BiG Mito-Split assay we analyzed the published structures of the mitoribosome and ATP synthase (Desai et al, 2017; Srivastava et al, 2018; Guo et al, 2017) and classified all proteins as either having C-termini in, or out of,  the complex. There was no difference between the “in” and “out” groups in the percentage observed in the BiG Mito-Split collection (Fig. 3 – figure supplement 2A) suggesting that the majority of the GFP11tagged proteins have a chance to interact with GFP1-10 before (or instead of) assembling into the complex. PCR and western blot verification of eight strains with the tagged complex subunits for which we observed no signal showed that mitoribosomal proteins were incorrectly tagged or not expressed, and the ATP synthase subunits Atp7, Atp19, and Atp20 were expressed (Fig. 3 – Supplement 2B). Atp19 and Atp20 have their C-termini most likely oriented towards the IMS (Guo et al, 2017) while Atp7 is completely in the matrix and may be the one example of a subunit whose assembly into a complex prevents its detection by the BiG Mito-Split assay.”

      We also consider related points on the interference of the tag and the influence of protein essentiality in the replies to points 3) and 12) of these reviews.

      (2) The imaging data is of high quality, but the manuscript would greatly benefit from additional analysis to support the claims or hypothesis brought forward by the authors. The idea that the nonmitochondrial proteins are imported due to their high sequence similarity to MTS could be easily addressed at least for some of these proteins via import studies, as also suggested by the authors.

      The idea that non-mitochondrial proteins may be imported into mitochondria due to occasional sequence similarity was recently demonstrated experimentally by (Oborská-Oplová et al, 2025). We incorporate this information in the Discussion section as follows (P. 14 Lines 10-16):

      “It was also recently shown that the r-protein uS5 (encoded by RPS2 in yeast) has a latent MTS that is masked by a special mitochondrial avoidance segment (MAS) preceding it (Oborská-Oplová et al, 2025). The removal of the MAS leads to import of uS5 into mitochondria killing the cells. The case of uS5 is an example of occasional similarity between an r-protein and an MTS caused by similar requirements of positive charges for rRNA binding and mitochondrial import. It remains unclear if other r-proteins have a MAS and if there are other mechanisms that protect mitochondria from translocation of cytosolic proteins.”

      We also conducted additional analysis to substantiate the claim that ribosomal (r)-proteins are similar in their physico-chemical properties to MTS-containing mitochondrial proteins. For this we chose not to use prediction algorithms like TartgetP and MitoFates that were already trained on the same dataset of yeast proteins to discriminate cytosolic and mitochondrial localization. Instead, we extended the analysis earlier made by (Woellhaf et al, 2014) and calculated several different properties such as charge, hydrophobicity, hydrophobic moment and amino acid content for mitochondrial MTS-containing proteins, cytosolic non-ribosomal proteins, and r-proteins. The analysis showed striking similarity of r-proteins and mitochondrial proteins. We incorporate a new Figure 3 – figure supplement 3 and the following text in the Results section (P. 8 Lines14-22): 

      “Five out of eight proteins are components of the cytosolic ribosome (r-proteins). In agreement with previous reports (Woellhaf et al, 2014) we find that their unique properties, such as charge, hydrophobicity and amino acid content, are indeed more similar to mitochondrial proteins than to cytosolic ones (Fig. 3 – figure supplement 3). Additional experiments with heterologous protein expression and in vitro import will be required to confirm the mitochondrial import and targeting mechanisms of these eight non-mitochondrial proteins. The data highlights that out of hundreds of very abundant proteins with high prediction scores only few are actually imported and highlights the importance of the mechanisms that help to avoid translocation of wrong proteins (Oborská-Oplová et al, 2025).”

      To further prove the possibility of r-protein import into mitochondria we aimed to clone the r-proteins identified in this work for cell-free expression and import into purified mitochondria. Despite the large effort, we have succeeded in cloning and efficiently expressing only Rpl23a (Author response image 1 A). Rpl23a indeed forms proteinase-protected fractions in a membrane potential-dependent manner when incubated with mitochondria. The inverse import dynamics of Rpl23a could be either indicative of quick degradation inside mitochondria or of background signal during the import experiments (Author response image 1.A). To address the r-protein degradation possibility, we measured how does GFP signal change in the BiG Mito-Split diploid collection strains after blocking cytosolic translation with cycloheximide (CHX). For this we selected Mrpl12a, that had one of the highest signals. We did not detect any drop in fluorescence signal for Rpl12a and the control protein Mrpl6 (Author response image 1 B). This might indicate the lack of degradation, or the degradation of the whole protein except GFP<sub>11</sub> that remains connected to GFP<sub>1-10</sub>. Due to time constrains we could not perform all experiments for the whole set of potentially imported r-proteins. Since more experiments are required to clearly show the mechanisms of mitochondrial r-protein import, degradation, and toxicity, or possible moonlighting functions (such as import into mitochondria derived from pim1∆ strain, degradation assays, fractionations, and analyses with antibodies for native proteins) we decided not to include this new data into the manuscript itself.

      Author response image 1.

      The import of r-proteins into mitochondria and their stability. (A) Rpl23 was synthesized in vitro (Input), radiolabeled, and imported into mitochondria isolated from BY4741 strain as described before (Peleh et al, 2015); the import was performed for 5,10, or 15 minutes and mitochondria were treated with proteinase K (PK) to degrade nonimported proteins; some reactions were treated with the mix of valinomycin, antimycin, and oligomycin (VAO) to dissipate mitochondrial membrane potential; the proteins were visualized by SDS-PAGE and autoradiography (B) The strains from the diploid BiG Mito-Split collection were grown in YPD to mid-logarithmic growth phase, then CHX was added to block translation and cell aliquots were taken from the culture and analyzed by fluorescence microscopy at the indicated time points. Scale bar is 5 µm.

      (3) The claim that the approach can be used to assess the topology of inner membrane proteins is problematic as the C-terminal tag can alter the biogenesis pathway of the protein or impact on the translocation dynamics (in particular as the imaging method applied here does not allow for analysis of dynamics). The hypothesis that the biogenesis route can be monitored is therefore far-reaching. To strengthen the hypothesis the authors should assess if the C-terminal GFP11 influences protein solubility by assessing protein aggregation of e.g. Rip1.

      We agree with the reviewer that the tag and assembly of GFP<sub>1-10/11</sub> can further complicate the assessment of topology of the IM proteins that already have complex biogenesis routes (lateral transfer, conservative, and a Rip1-specific Bcs1 pathway). To emphasize that the assessment of the steady state topology needs to be backed up by additional biochemical approaches, we edited the beginning of the corresponding Results sections as follows (P. 11 Lines 2-6): 

      “Studying membrane protein biogenesis requires an accurate way to determine topology in vivo. The mitochondrial IM is one of the most protein-rich membranes in the cell supporting a wide variety of TMD topologies with complex biogenesis pathways. We aimed to find out if our BiG Mito-Split collection can accurately visualize the steady-state localization of membrane protein C-termini protruding into the matrix or trap protein transport intermediates” (inserted text is underlined).

      The collection that we studied by microscopy is diploid and contains one WT copy of each 3xGFP<sub>11</sub>tagged gene. To assess the influence of the tag on the protein function we performed growth assays with haploid strains which have one 3xGFP<sub>11</sub>-tagged gene copy and no GFP<sub>1-10</sub>. We find that Rip13xGFP<sub>11</sub> displays slower growth on glycerol at 30˚C and even slower at 37˚C while tagged Qcr8, Qcr9, and Qcr10 grow normally (Author response image 2 A). Based on the growth assays and microscopy it is not possible to conclude whether the “Qcr” proteins’ biogenesis is affected by the tag. It may be that laterally sorted proteins are functional with the tag and constitute the majority while only a small portion is translocated into the matrix, trapped and visualized with GFP<sub>1-10</sub>. In case of Rip1 it was shown that C-terminal tag can affect its interaction with the chaperone Mzm1 and promote Rip1 aggregation (Cui et al, 2012). The extent of Rip1 function disruption can be different and depends on the tag. We hypothesize that our split-assay may trap the pre-translocation intermediate of Rip1 and can be helpful to study its interactors. To test this, we performed anti-GFP immune-precipitation (IP) using GFP-Trap beads (Author response image 2 B).

      Author response image 2.

      The influence of 3x-GFP11 on the function and processing of the inner membrane proteins. (A) Drop dilution assays with haploid strains from C-SWAT 3xGFP<Sub>11</sub> library on fermentative (YPD) and respiratory (YPGlycerol) media at different temperatures. (B) Immuno-precipitation with GFP-Trap agarose was performed on haploid strain that has only Rip1-3xGFP<sub>11</sub> and on the diploid strain derived from this haploid mated with BiG Mito-Split strain containing mtGFP<sub>1-10</sub> and WT untagged Rip1 using the lysis (1% TX-100) and washing protocols provided by the manufacturer; the total (T) and eluted with the Laemmli buffer (IP) samples were analyzed by immunoblotting with polyclonal rabbit antibodies against GFP (only visualizes GFP<Sub>11</sub> in these samples) and Rip1 (visualizes both tagged and WT Rip1). Polyclonal home-made rabbit antisera for GFP and Rip1 were kindly provided by Johannes Herrmann (Kaiserslautern) and Thomas Becker (Bonn); the antisera were diluted 1:500 for decorating the membranes.

      We find that the haploid strain with Rip1-3xGFP<sub>11</sub> contains not only mature (m) and intermediate (i) forms but also an additional higher Mw band that we interpreted as precursor that was not cleaved by MPP. WT Rip1 in the diploid added two more lower Mw bands: (m) and (i) forms of the untagged Rip1. IP successfully enriched GFP<sub>1-10</sub> fragment as visualized by anti-GFP staining. Interestingly only the highest Mw Rip1-3xGFP<sub>11</sub> band was also enriched when anti-Rip1 antibodies were used to analyze the samples. This suggests that Rip1 precursor gets completely imported and interacts with GFP<sub>1-10</sub> and can be pulled down. It is however not processed. Processed Rip1 is not interacting with GFP<sub>1-10</sub>. Based on the literature we expect all Rip1 in the matrix to be cleaved by MPP including the one interacting with GFP. Due to this discrepancy, we did not include this data in the manuscript. This is however clear that the assay may be useful to analyze biogenesis intermediates of the IM and matrix proteins. To emphasize this, we added information on the C-terminal tagging of Rip1 in the Results section (P. 11 Lines 18-20):

      “It was shown that a C-terminal tag on Rip1 can prevent its interaction with the chaperone Mzm1 and promote aggregation in the matrix (Cui et al, 2012). It is also possible that our assay visualizes this trapped biogenesis intermediate.”

      We also added a note on biogenesis intermediates in the Discussion (P. 14 Line 36 onwards): 

      “It is possible that the proteins with C-termini that are translocated into the IMS from the matrix side can be trapped by the interaction with GFP<sub>1-10</sub>. In that case, our assay can be a useful tool to study these pre-translocation intermediates.”

      (4) The hypothesis that the method can reveal new substrates for Bcs1 is interesting, and it would strongly increase the relevance for the scientific community if this would be directly tested, e.g. by deleting BCS1 and testing if more IM proteins are then detected by interaction with the matrix GFP110.

      we attempted to move the BiG Mito-Split assay into haploid strains where BCS1 and other factors can be deleted, however, this was not successful. Since this was a big effort (We cloned 10 potential substrate proteins but none of them were expressed) we decided not to pursue this further.

      (5) The screening of six different growth conditions reflects the strength of the high-throughput imaging readout. However, the interpretation of the data and additional follow-up on this is rather short and would be a nice addition to the present manuscript. In addition, one wonders, what was the rationale behind these six conditions (e.g. DTT treatment)? The direct metabolic shift from fermentation to respiration to boost mitochondrial biogenesis would be a highly interesting condition and the authors should consider adding this in the present manuscript.

      we agree with the reviewer that the analysis of different conditions is a strength of this work. However, we did not reveal any clear protein groups with strong conditional import and thus it was hard to select a follow-up candidate. The selection of conditions was partially driven by the technical possibilities: the media change is challenging on the robotic system; heat shock conditions make microscope autofocus unstable; library strain growth on synthetic respiratory media is very slow and the media cannot be substituted with rich media due to its autofluorescence. However, the usage of the spinning disc confocal microscope allowed us to screen directly in synthetic oleate media which has a lot of background on widefield systems due to oil micelles. We extended the explanation of condition choice as follows (P. 4 Line 34 onwards): 

      “The diploid BiG Mito-Split collection was imaged in six conditions representing various carbon sources and a diversity of stressors the cells can adapt to: logarithmic growth on glucose as a control carbon source and oleic acid as a poorly studied carbon source; post-diauxic (stationary) phase after growth on glucose where mitochondria, are more active and inorganic phosphate (Pi) depletion that was recently described to enhance mitochondrial membrane potential (Ouyang et al, 2024); as stress conditions we chose growth on glucose in the presence of 1 mM dithiothreitol (DTT) that might interfere with the disulfide relay system in the IMS, and nitrogen starvation as a condition that may boost biosynthetic functions of mitochondria. DTT and nitrogen starvation were earlier used for a screen with the regular C’-GFP collection (Breker et al, 2013). Another important consideration for selecting the conditions was the technical feasibility to implement them on automated screening setups.”

      Reviewer #3 (Recommendations for the authors )

      (6) This is a very elegant and clearly written study. As mentioned above, my only concern is that the biological significance of the obtained data, at this stage, is rather limited. It would have been nice if the authors explored one of the potential applications of the system they propose. For example, it should be relatively easy to analyze whether Cox26, Qcr8, Qcr9, or Qcr10 are new substrates of Bsc1, as the authors speculate.

      we thank the reviewer for their positive feedback. We addressed the biological application of the screen by including new data on metabolite concentrations in the strains where Gpp1 N-terminus was mutated leading to loss of the mitochondrial form. We added panels H and I to Figure 4, the new Supplementary Table S2 and appended the description of these results at the end of the third Results subsection (P. 10 Lines 19-35). Our data now show a role for the mitochondrial fraction of Gpp1 which adds mechanistic insight into this dually localized protein.

      We also were interested in the applications of our system to the study of mitochondrial import. However, the study of Cox26, Qcr8, Qcr9, and Qcr10 was not successful (also related to point 4, Reviewer #2). We thus decided to investigate the import mechanisms of the poorly studied dually localized proteins Arc1, Fol3, and Hom6 (related to Figure 4 of the original manuscript). To this end, we expressed these proteins in vitro, radiolabeled, and performed import assays with purified mitochondria. Arc1 was not imported, Fol3 and Hom6 gave inconclusive results (Author response image 3). Since it is known that even some genuine fully or dually localized mitochondrial proteins such as Fum1 cannot be imported in vitro post-translationally (Knox et al, 1998), we cannot draw conclusions from these experiments and left them out of the revised manuscript. Additional investigation is required to clarify if there exist special cytosolic mechanisms for the import of these proteins that were not reconstituted in vitro such as co-translational import.

      Author response image 3.

      In vitro import of poorly studies dually localized proteins. Arc1, Fol3, and Hom6 were cloned into pGEM4 plasmid, synthesized in vitro (Input), radiolabeled, and imported into mitochondria isolated from BY4741 strain as described before (Peleh et al, 2015); the import was performed for 5,10, or 15 minutes and mitochondria were treated with proteinase K (PK) to degrade non-imported proteins; some reactions were treated with the mix of valinomycin, antimycin, and oligomycin (VAO) to dissipate mitochondrial membrane potential. The proteins were separated by SDS-PAGE and visualized by autoradiography.

      Minor comments:

      (7) It is unclear why the authors used the six growth conditions they used, and why for example a nonfermentable medium was not included at all.

      we address this shortcoming in the reply to the previous point 5 (Reviewer #2).

      (8) Page 2, line 17 - "Its" should be corrected to "its".

      Changed

      (9) Page 2, line 25 to the end of the paragraph - the authors refer to the TIM complex when actually the TIM23 complex is probably meant. Also, it would be clearer if the TIM22 complex was introduced as well, especially in the context of the sentence stating that "the IM is a major protein delivery destination in mitochondria".

      This was corrected.

      (10) Page 5, line 35 - "who´s" should be corrected to "whose".

      This was corrected.

      (11) Page 9, line 5 - "," after Gpp1 should probably be "and".

      This was corrected.

      (12) Page 11 - the authors discuss in several places the possible effects of tags and how they may interfere with "expression, stability and targeting of proteins". Protein function may also be dramatically affected by tags - a quick look into the dataset shows that several mitochondrial matrix and inner membrane proteins that are essential for cell viability were not identified in the screen, likely because their function is impaired.

      we agree with the reviewer that the influence of tags needs to be carefully evaluated. This is not always possible in the context of whole genomic screens. Sometimes, yeast collections (and proteomic datasets) can miss well-known mitochondrial residents without a clear reason. To address this important point we conducted an additional analysis to look specifically at the essential proteins. We indeed found that several of the mitochondrial proteins that are essential for viability were absent from the collection at the start, but for those present, their essentiality did not impact the likelihood to be detected in our assay. To describe the analysis we added the following text and a Fig. 3 – figure supplement 2. Results now read (P.7 Lines 8-21): 

      “Next, we checked the two categories of proteins likely to give biased results in high-throughput screens of tagged collections: proteins essential for viability, and molecular complex subunits. To look at the first category we split the proteomic dataset of soluble matrix proteins (Vögtle et al. 2017) into essential and non-essential ones according to the annotations in the Saccharomyces Genome Database (SGD) (Wong et al, 2023). We found that there was no significant difference in the proportion of detected proteins in both groups (17 and 20 % accordingly), despite essential proteins being less represented in the initial library (Fig. 3 – figure supplement 2A). From the three essential proteins of the (Vögtle et al. 2017) dataset for which the strains present in our library but showed no signal, two were nucleoporins Nup57 and Nup116, and one was a genuine mitochondrial protein Ssc1. Polymerase chain reaction (PCR) and western blot verification showed that the Ssc1 strain was incorrect (Fig. 3 – figure supplement 2B). We conclude that essential proteins are more likely to be absent or improperly tagged in the original C’-SWAT collection, but the essentiality does not affect the results of the BiG Mito-Split assay.” 

      Discussion (P. 13 Lines 23-26): 

      “We did not find that protein complex components or essential proteins are more likely to be falsenegatives. However, some essential proteins were absent from the collection to start with (Fig. 3 – figure supplement 2A). Thus, a small tag allows visualization of even complex proteins.” 

      From our data it is difficult to estimate the effect of tagging on protein function. We also addressed the effect of tagging Rip1 as well as performed growth assays on the tagged small “Qcr proteins” in the reply to point 3 (Reviewer #2). It is also difficult to estimate the effect of GFP<sub>1-10</sub> and <sub>11</sub> complex assembly on protein function since the presence of functional, unassembled GFP<sub>11</sub> tagged pool cannot be ruled out in our assay. 

      Other changes

      Figure and table numbers changed after new data additions.

      A sentence added in the abstract to highlight the additional experiments on Gpp1 function: “We use structure-function analysis to characterize the dually localized protein Gpp1, revealing an upstream start codon that generates a mitochondrial targeting signal and explore its unique function.”

      The reference to the PCR verification (Fig. 3 – Supplement 2B) of correct tagging of Ycr102c was added to the Results section (P.8 Line 6), western blot verification added on.

      Added the Key Resources Table at the beginning of the Methods section.

      Small grammar edits, see tracked changes.

      References:

      Bader G, Enkler L, Araiso Y, Hemmerle M, Binko K, Baranowska E, De Craene J-O, Ruer-Laventie J, Pieters J, Tribouillard-Tanvier D, et al (2020) Assigning mitochondrial localization of dual localized proteins using a yeast Bi-Genomic Mitochondrial-Split-GFP. eLife 9: e56649

      Cui T-Z, Smith PM, Fox JL, Khalimonchuk O & Winge DR (2012) Late-Stage Maturation of the Rieske Fe/S Protein: Mzm1 Stabilizes Rip1 but Does Not Facilitate Its Translocation by the AAA ATPase Bcs1. Mol Cell Biol 32: 4400–4409

      Desai N, Brown A, Amunts A & Ramakrishnan V (2017) The structure of the yeast mitochondrial ribosome. Science 355: 528–531

      Guo H, Bueler SA & Rubinstein JL (2017) Atomic model for the dimeric FO region of mitochondrial ATP synthase. Science 358: 936–940

      Knox C, Sass E, Neupert W & Pines O (1998) Import into Mitochondria, Folding and Retrograde Movement of Fumarase in Yeast. J Biol Chem 273: 25587–25593

      Morgenstern M, Stiller SB, Lübbert P, Peikert CD, Dannenmaier S, Drepper F, Weill U, Höß P, Feuerstein R, Gebert M, et al (2017) Definition of a High-Confidence Mitochondrial Proteome at Quantitative Scale. Cell Rep 19: 2836–2852

      Oborská-Oplová M, Geiger AG, Michel E, Klingauf-Nerurkar P, Dennerlein S, Bykov YS, Amodeo S, Schneider A, Schuldiner M, Rehling P, et al (2025) An avoidance segment resolves a lethal nuclear–mitochondrial targeting conflict during ribosome assembly. Nat Cell Biol 27: 336–346

      Peleh V, Ramesh A & Herrmann JM (2015) Import of Proteins into Isolated Yeast Mitochondria. In Membrane Trafficking: Second Edition, Tang BL (ed) pp 37–50. New York, NY: Springer

      Srivastava AP, Luo M, Zhou W, Symersky J, Bai D, Chambers MG, Faraldo-Gómez JD, Liao M & Mueller DM (2018) High-resolution cryo-EM analysis of the yeast ATP synthase in a lipid membrane. Science 360: eaas9699

      Vögtle F-N, Burkhart JM, Gonczarowska-Jorge H, Kücükköse C, Taskin AA, Kopczynski D, Ahrends R, Mossmann D, Sickmann A, Zahedi RP, et al (2017) Landscape of submitochondrial protein distribution. Nat Commun 8: 290

      Woellhaf MW, Hansen KG, Garth C & Herrmann JM (2014) Import of ribosomal proteins into yeast mitochondria. Biochem Cell Biol 92: 489–498

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Summary:

      This work shows that a specific adenosine deaminase protein in Dictyostelium generates the ammonia that is required for tip formation during Dictyostelium development. Cells with an insertion in the ADGF gene aggregate but do not form tips. A remarkable result, shown in several different ways, is that the ADGF mutant can be rescued by exposing the mutant to ammonia gas. The authors also describe other phenotypes of the ADGF mutant such as increased mound size, altered cAMP signalling, and abnormal cell type differentiation. It appears that the ADGF mutant has defects in the expression of a large number of genes, resulting in not only the tip defect but also the mound size, cAMP signalling, and differentiation phenotypes.

      Strengths:

      The data and statistics are excellent.

      (1) Weaknesses: The key weakness is understanding why the cells bother to use a diffusible gas like ammonia as a signal to form a tip and continue development.

      Ammonia can come from a variety of sources both within and outside the cells and this can be from dead cells also. Ammonia by increasing cAMP levels, trigger collective cell movement thereby establishing a tip in Dictyostelium. A gaseous signal can act over long distances in a short time and for instance ammonia promotes synchronous development in a colony of yeast cells (Palkova et al., 1997; Palkova and Forstova, 2000). The slug tip is known to release ammonia probably favouring synchronized development of the entire colony of Dictyostelium. However, after the tips are established ammonia exerts negative chemotaxis probably helping the slugs to move away from each other ensuring equal spacing of the fruiting bodies (Feit and Sollitto, 1987).

      It is well known that ammonia serves as a signalling molecule influencing both multicellular organization and differentiation in Dictyostelium (Francis, 1964; Bonner et al., 1989; Bradbury and Gross, 1989). Ammonia by raising the pH of the intracellular acidic vesicles of prestalk cells (Poole and Ohkuma, 1981; Gross et al, 1983), and the cytoplasm, is known to increase the speed of chemotaxing amoebae (Siegert and Weijer, 1989; Van Duijn and Inouye, 1991), inducing collective cell movement (Bonner et al., 1988, 1989), favoring tipped mound development.

      Ammonia produced in millimolar concentrations during tip formation (Schindler and Sussman, 1977) could ward off other predators in soil. For instance, ammonia released by Streptomyces symbionts of leaf-cutting ants is known to inhibit fungal pathogens (Dhodary and Spiteller, 2021). Additionally, ammonia may be recycled back into amino acids, as observed during breast cancer proliferation (Spinelli et al., 2017). Such a process may also occur in starving Dictyostelium cells, supporting survival and differentiation. These findings suggest that ammonia acts as both a local and long-range regulatory signal, integrating environmental and cellular cues to coordinate multicellular development.

      (2) The rescue of the mutant by adding ammonia gas to the entire culture indicates that ammonia conveys no positional information within the mound.

      Ammonia reinforces or maintains the positional information by elevating cAMP levels, favoring prespore differentiation (Bradbury and Gross, 1989; Riley and Barclay, 1990; Hopper et al., 1993). Ammonia is known to influence rapid patterning of Dictyostelium cells confined in a restricted environment (Sawai et al., 2002). In adgf mutants that have low ammonia levels, both neutral red staining (a marker for prestalk and ALCs) (Figure. S3) and the prestalk marker ecmA/ ecmB expression (Figure. 7D) are higher than the WT and the mound arrest phenotype can be reversed by exposing the adgf mutant mounds to ammonia.

      Prestalk cells are enriched in acidic vesicles, and ammonia, by raising the pH of these vesicles and the cytoplasm (Davies et al 1993; Van Duijn and Inouye 1991), plays an active role in collective cell movement during tip formation (Bonner et al., 1989).

      (3) By the time the cells have formed a mound, the cells have been starving for several hours, and desperately need to form a fruiting body to disperse some of themselves as spores, and thus need to form a tip no matter what.

      Exposure of adgf mounds to ammonia, led to tip development within 4 h (Figure. 5). In contrast, adgf controls remained at the mound stage for at least 30 h. This demonstrates that starvation alone is not the trigger for tip development and ammonia promotes the transition from mound to tipped mound formation.

      Many mound arrest mutants are blocked in development and do not proceed to form fruiting bodies (Carrin et al., 1994). Further, not all the mound arrest mutants tested in this study were rescued by ADA enzyme (Figure. S4A), and they continue to stay as mounds.

      (4) One can envision that the local ammonia concentration is possibly informing the mound that some minimal number of cells are present (assuming that the ammonia concentration is proportional to the number of cells), but probably even a minuscule fruiting body would be preferable to the cells compared to a mound. This latter idea could be easily explored by examining the fate of the ADGF cells in the mound - do they all form spores? Do some form spores?

      Or perhaps the ADGF is secreted by only one cell type, and the resulting ammonia tells the mound that for some reason that cell type is not present in the mound, allowing some of the cells to transdifferentiate into the needed cell type. Thus, elucidating if all or some cells produce ADGF would greatly strengthen this puzzling story.

      A fraction of adgf mounds form bulkier spore heads by the end of 36 h as shown in Figure. 2H. This late recovery may be due to the expression of other ADA isoforms. Mixing WT and adgf mutant cell lines results in a chimeric slug with mutants occupying the prestalk region (Figure. 8) and suggests that WT ADGF favours prespore differentiation. However, it is not clear if ADGF is secreted by a particular cell type, as adenosine can be produced by both cell types, and the activity of three other intracellular ADAs may vary between the cell types. To address whether adgf expression is cell type-specific, prestalk and prespore cells will be separated by fluorescence activated cell sorter (FACS), and thereafter, adgf expression will be examined in each population.

      Reviewer #2 (Public review):

      Summary:

      The paper describes new insights into the role of adenosine deaminase-related growth factor (ADGF), an enzyme that catalyses the breakdown of adenosine into ammonia and inosine, in tip formation during Dictyostelium development. The ADGF null mutant has a pre-tip mound arrest phenotype, which can be rescued by the external addition of ammonia. Analysis suggests that the phenotype involves changes in cAMP signalling possibly involving a histidine kinase dhkD, but details remain to be resolved.

      Strengths:

      The generation of an ADGF mutant showed a strong mound arrest phenotype and successful rescue by external ammonia. Characterization of significant changes in cAMP signalling components, suggesting low cAMP signalling in the mutant and identification of the histidine kinase dhkD as a possible component of the transduction pathway. Identification of a change in cell type differentiation towards prestalk fate

      (1) Weaknesses: Lack of details on the developmental time course of ADGF activity and cell type type-specific differences in ADGF expression.

      adgf expression was examined at 0, 8, 12, and 16 h (Figure. 1), and the total ADA activity was assayed at 12 and 16 h (Figure. 3). Previously, the 12 h data was not included, and it’s been added now (Figure. 3A). The adgf expression was found to be highest at 16 h and hence, the ADA assay was carried out at that time point. Since the ADA assay will also report the activity of other three isoforms, it will not exclusively reflect ADGF activity.

      Mixing WT and adgf mutant cell lines results in a chimeric slug with mutants occupying the prestalk region (Figure. 8) suggesting that WT adgf favours prespore differentiation. To address whether adgf expression is cell type-specific, prestalk and prespore cells will be separated by fluorescence activated cell sorter (FACS), and thereafter, adgf expression will be examined in each population.

      (2) The absence of measurements to show that ammonia addition to the null mutant can rescue the proposed defects in cAMP signalling.

      The adgf mutant in comparison to WT has diminished acaA expression (Fig. 6B) and reduced cAMP levels (Fig. 6A) both at 12 and 16 h of development. The cAMP levels were measured at 8 h and 12 h in the mutant.

      We would like to add that ammonia is known to increase cAMP levels (Riley and Barclay, 1990; Feit et al., 2001) in Dictyostelium. Exposure to ammonia increases acaA expression in WT (Figure. 7B) and is likely to increase acaA expression/ cAMP levels in the mutant also (Riley and Barclay, 1990; Feit et al., 2001) thereby rescuing the defects in cAMP signalling. Based on the comments, cAMP levels will also be measured in the mutant after the rescue with ammonia.

      (3) No direct measurements in the dhkD mutant to show that it acts upstream of adgf in the control of changes in cAMP signalling and tip formation.

      cAMP levels will be quantified in the dhkD mutant after treatment with ammonia. The histidine kinases dhkD and dhkC are reported to modulate phosphodiesterase RegA activity, thereby maintaining cAMP levels (Singleton et al., 1998; Singleton and Xiong, 2013). By activating RegA, dhkD ensures proper cAMP distribution within the mound, which is essential for the patterning of prestalk and prespore cells, as well as for tip formation (Singleton and Xiong, 2013). Therefore, ammonia exposure to dhkD mutants is likely to regulate cAMP signalling and thereby tip formation.

      Reviewer #1 (Recommendations for the authors):

      (1) Lines: 47,48 - "The gradient of these morphogens along the slug axis determines the cell fate, either as prestalk (pst) or as prespore (psp) cells." - many workers have shown that this is not true - intrinsic factors such as cell cycle phase drive cell fate.

      Thank you for pointing this out. We have removed the line and rephrased as “Based on cell cycle phases, there exists a dichotomy of cell types, that biases cell fate as prestalk or prespore (Weeks and Weijer, 1994; Jang and Gomer, 2011).

      (2) Line 48 - PKA - please explain acronyms at first use.

      Corrected

      (3) Line 56 - The relationship between adenosine deaminase and ADGF is a bit unclear, please clarify this more.

      Adenosine deaminase (ADA) is intracellular, whereas adenosine deaminase related growth factor (ADGF) is an extracellular ADA and has a growth factor activity (Li and Aksoy, 2000; Iijima et al., 2008).

      (4) Figure 1 - where are these primers, and the bsr cassette, located with respect to the coding region start and stop sites?

      The primer sequences are mentioned in the supplementary table S2. The figure legend is updated to provide a detailed description.

      (5) Line 104 - 37.47% may be too many significant figures.

      Corrected

      (6) Line 123 - 1.003 Å may be too many significant figures.

      Corrected

      (7) Line 128 - Since the data are in the figure, you don't need to give the numbers, also too many significant figures.

      Corrected

      (8) Figure 3G - did the DCF also increase mound size? It sort of looks like it did.

      Yes, the addition of DCF increases the mound size (now Figure. 2G).

      (9) Figure 3I - the spore mass shown here for ADGF - looks like there are 3 stalks protruding from it; this can happen if a plate is handled roughly and the spore masses bang into each other and then merge

      Thank you for pointing this out. The figure 3I (now Figure. 2I) is replaced.

      (10) Lines 160-162 - since the data are in the figure, you don't need to give the numbers, also too many significant figures.

      Corrected.

      (11) Line 165 - ' ... that are involved in adenosine formation' needs a reference.

      Reference is included.

      (12) Line 205 - 'Addition of ADA to the CM of the mutant in one compartment.' - might clarify that the mutant is the ADGF mutant

      Yes, revised to 'Addition of ADA to the CM of the adgf mutant in one compartment.'

      (13) Lines 222-223 need a reference for caffeine acting as an adenosine antagonist.

      Reference is included.

      (14) Figure 8B - left - use a 0-4 or so scale so the bars are more visible.

      Thank you for the suggestion. The scale of the y-axis is adjusted to 0-4 in Figure. 7B to enhance the visibility of the bars.

      Reviewer #2 (Recommendations for the authors):

      The paper describes new insights into the role of ADGF, an enzyme that catalyses the breakdown of adenosine in ammonia and inosine, in tip formation in Dictyostelium development.

      A knockout of the gene results in a tipless mound stage arrest and the mounds formed are somewhat larger in size. Synergy experiments show that the effect of the mutation is non-cell autonomous and further experiments show that the mound arrest phenotype can be rescued by the provision of ammonia vapour. These observations are well documented. Furthermore, the paper contains a wide variety of experiments attempting to place the observed effects in known signalling pathways. It is suggested that ADGF may function downstream of DhkD, a histidine kinase previously implicated in ammonia signalling. Ammonia has long been described to affect different aspects, including differentiation of slug and culmination stages of Dictyostelium development, possibly through modulating cAMP signalling, but the exact mechanisms of action have not yet been resolved. The experiments reported here to resolve the mechanistic basis of the mutant phenotype need focusing and further work.

      (1) The paper needs streamlining and editing to concentrate on the main findings and implications.

      The manuscript will be revised extensively.

      Below is a list of some more specific comments and suggestions.

      (2) Introduction: Focus on what is relevant to understanding tip formation and the role of nucleotide metabolism and ammonia (see https://doi.org/10.1016/j.gde.2016.05.014).leading). This could lead to the rationale for investigating ADGF.

      The manuscript will be revised extensively

      (3) Lines 36-38 are not relevant. Lines 55-63 need shortening and to focus on ADGF, cellular localization, and substrate specificity.

      The manuscript will be revised accordingly. Lines 36-38 will be removed, and the lines 55-63 will be shortened.

      In humans, two isoforms of ADA are known including ADA1 and ADA2, and the Dictyostelium homolog of ADA2 is adenosine deaminase-related growth factor (ADGF). Unlike ADA that is intracellular, ADGF is extracellular and also has a growth factor activity (Li and Aksoy, 2000; Iijima et al., 2008). Loss-of-function mutations in ada2 are linked to lymphopenia, severe combined immunodeficiency (SCID) (Gaspar, 2010), and vascular inflammation due to accumulation of toxic metabolites like dATP (Notarangelo, 2016; Zhou et al., 2014).

      (4) Results: This section would benefit from better streamlining by a separation of results that provide more mechanistic insight from more peripheral observations.

      The manuscript will be revised and the peripheral observations (Figure. 2) will be shifted to the supplementary information.

      (5) Line 84 needs to start with a description of the goal, to produce a knockout.

      Details on the knockout will be elaborated in the revised manuscript. Line number 84 (now 75). Dictyostelium cell lines carrying mutations in the gene adgf were obtained from the genome wide Dictyostelium insertion (GWDI) bank and were subjected to further analysis to know the role of adgf during Dictyostelium development.

      (6) Knockout data (Figure 1) can be simplified and combined with a description of the expression profile and phenotype Figure 3 F, G, and Figure 5. Higher magnification and better resolution photographs of the mutants would be desirable.

      Thank you, as suggested the data will be simplified (section E will be removed) and combined with a description of the expression profile and, the phenotype images of Figure 3 F, G, and Figure 5 ( now Figure. 2 F, G, and Figure. 4) will be replaced with better images/ resolution.

      (7) It would also be relevant to know which cells actually express ADGF during development, using in-situ hybridisation or promoter-reporter constructs.

      To address whether adgf expression is cell type-specific, prestalk and prespore cells will be separated by fluorescence activated cell sorter (FACS), and thereafter, adgf expression will be examined in each population.

      (8) Figure 2 - Information is less directly relevant to the topic of the paper and can be omitted (or possibly in Supplementary Materials).

      Figure. 2 will be moved to supplementary materials.

      (9) Figures 4A, B - It is shown that as could be expected ada activity is somewhat reduced and adenosine levels are slightly elevated. However, the fact that ada levels are low at 16hrs could just imply that differentiation of the ADGF- cells is blocked/delayed at an earlier time point. To interpret these data, it would be necessary to see an ada activity and adenosine time course comparison of wt and mutant, or to see that expression is regulated in a celltype specific manner that could explain this (see above). It would be good to combine this with the observation that ammonia levels are lower in the ADGF- mutant than wildtype and that the mutant phenotype, mound arrest can be rescued by an external supply of ammonia (Figure 6).

      In Dictyostelium four isoforms of ADA including ADGF are present, and thus the time course of total ADA activity will also report the function of other isoforms. Further, a number of pathways, generate adenosine (Dunwiddie et al., 1997; Boison and Yegutkin, 2019). ADGF expression was examined at 0, 8, 12 and 16 h (Fig 1) and the ADA activity was assayed at 12 h, the time point where the expression gradually increases and reaches a peak at 16 h. Earlier, we have not shown the 12 h activity data which will be included in the revised version. ADGF expression was found to be highly elevated at 16 h and adenosine/ammonia levels were measured at the two points indicated in the mutant.

      (10) Panel 4C could be combined with other measurements trying to arrive at more insight in the mechanisms by which ammonia controls tip formation.

      Panel 4C (now 3C) illustrates the genes involved in the conversion of cAMP to adenosine. Since Figure. 3 focuses on adenosine levels and ADA activity in both WT and adgf mutants, we have retained Panel 3C in Figure. 3, for its relevance to the experiment.

      (11) There is a large variety of experiments attempting to link the mutant phenotype and its rescue by ammonia to cAMP signalling, however, the data do not yet provide a clear answer.

      It is well known that ammonia increases cAMP levels (Riley and Barclay, 1990; Feit et al., 2001) and adenylate cyclase activity (Cotter et al., 1999) in D. discoideum, and exposure to ammonia increases acaA expression (Fig 7B) suggesting that ammonia regulates cAMP signaling. To address the concerns, cAMP levels will be quantified in the mutant after ammonia treatment.

      (12) The mutant is shown to have lower cAMP levels at the mound stage which ties in with low levels of acaA expression (Figures 7A and B), also various phosphodiesterases, the extracellular phosphodiesterase pdsa and the intracellular phosphodiesterase regA show increased expression. Suggesting a functional role for cAMP signalling is that the addition of di cGMP, a known activator of acaA, can also rescue the mound phenotype (Figure 7E). There appears to be a partial rescue of the mound arrest phenotype level by the addition of 8Br-cAMP (fig 7C), suggesting that intracellular cAMP levels rather than extracellular cAMP signalling can rescue some of the defects in the ADGF- mutant. Better images and a time course would be helpful.

      The relevant images will be replaced and a developmental time course after 8-Br-cAMP treatment will be included in the revised manuscript (Figure. 6D).

      (13) There is also the somewhat surprising observation that low levels of caffeine, an inhibitor of acaA activation also rescues the phenotype (Figure 7F).

      With respect to caffeine action on cAMP levels, the reports are contradictory. Caffeine has been reported to increase adenylate cyclase expression thereby increasing cAMP levels (Hagmann, 1986) whereas Alvarez-Curto et al., (2007) found that caffeine reduced intracellular cAMP levels in Dictyostelium. Caffeine, although is a known inhibitor of ACA, is also known to inhibit PDEs (Nehlig et al., 1992; Rosenfeld et al., 2014). Therefore, if caffeine differentially affects ADA and PDE activity, it may potentially counterbalance the effects and rescue the phenotype.

      (14) The data attempting to asses cAMP wave propagation in mounds (Fig 7H) are of low quality and inconclusive in the absence of further analysis. It remains unresolved how this links to the rescue of the ADGF- phenotype by ammonia. There are no experiments that measure any of the effects in the mutant stimulated with ammonia or di-cGMP.

      The relevant images will be replaced (now Figure. 6H). Ammonia by increasing acaA expression (Figure. 7B), and cAMP levels (Figure. 7C) may restore spiral wave propagation, thereby rescuing the mutant.

      (15) A possible way forward could also come from the observation that ammonia can rescue the wobbling mound arrest phenotype from the histidine kinase mutant dhkD null mutant, which has regA as its direct target, linking ammonia and cAMP signalling. This is in line with other work that had suggested that another histidine kinase, dhkC transduces an ammonia signal sensor to regA activation. A dhkC null mutant was reported to have a rapid development phenotype and skip slug migration (Dev. Biol. (1998) 203, 345). There is no direct evidence to show that dhkD acts upstream of ADGF and changes in cAMP signalling, for instance, measurements of changes in ADA activity in the mutant.

      cAMP levels will be quantified in the dhkD mutant after ammonia treatment and accordingly, the results will be revised.

      (16) The paper makes several further observations on the mutant. After 16 hrs of development the adgf- mutant shows increased expression of the prestalk cell markers ecmA and ecmB and reduced expression of the prespore marker pspA. In synergy experiments with a majority of wildtype, these cells will sort to the tip of the forming slug, showing that the differentiation defect is cell autonomous (Fig 9). This is interesting but needs further work to obtain more mechanistic insight into why a mutant with a strong tip/stalk differentiation tendency fails to make a tip. Here again, knowing which cells express ADGF would be helpful.

      The adgf mutant shows increased prestalk marker expression in the mound but do not form a tip. It is well known that several mound arrest mutants form differentiated cells but are blocked in development with no tips (Carrin et al., 1994). This is addressed in the discussions (539). To address whether adgf expression is cell type-specific, prestalk and prespore cells will be separated by fluorescence activated cell sorter (FACS), and thereafter, adgf expression will be examined in each population.

      (17) The observed large mound phenotype could as suggested possibly be explained by the low ctn, smlA, and high cadA and csA expression observed in the mutant (Figure 3). The expression of some of these genes (csA) is known to require extracellular cAMP signalling. The reported low level of acaA expression and high level of pdsA expression could suggest low levels of cAMP signalling, but there are no actual measurements of the dynamics of cAMP signalling in this mutant to confirm this.

      The acaA expression was examined at 8 and 12 h (Figure. 6B) and cAMP levels were measured at 12 and 16 h in the adgf mutants (Figure. 6A). Both acaA expression and cAMP levels were reduced, suggesting that cells expressing adgf regulate acaA expression and cAMP levels. This regulation, in turn, is likely to influence cAMP signaling, collective cell movement within mounds, ultimately driving tip development. Exposure to ammonia led to increased acaA expression (Figure. 7B) in in WT. Based on the comments above, cAMP levels will be measured in the mutant before and after rescue with ammonia.

      (18) Furthermore, it would be useful to quantify whether ammonia addition to the mutant reverses mound size and restores any of the gene expression defects observed.

      Ammonia treatment soon after plating or six hours after plating, had no effect on the mound size (Figure. 5G).

      (19) There are many experimental data in the supplementary data that appear less relevant and could be omitted Figure S1, S3, S4, S7, S8, S9, S10.

      Figure S8, S9, S10 are omitted. We would like to retain the other figures

      Figure S1 (now Figure. S2): It is widely believed that ammonia comes from protein (White and Sussman, 1961; Hames and Ashworth, 1974; Schindler and Sussman, 1977) and RNA (Walsh and Wright, 1978) catabolism. Figure. S2 shows no significant difference in protein and RNA levels between WT and adgf mutant strains, suggesting that adenosine deaminaserelated growth factor (ADGF) activity serves as a major source of ammonia and plays a crucial role in tip organizer development in Dictyostelium. Thus, it is important to retain this figure.

      Figure S3 (now Figure. S4): The figure shows the treatment of various mound arrest mutants and multiple tip mutants with ADA enzyme and DCF, respectively, to investigate the pathway through which adgf functions. Additionally, it includes the rescue of the histidine kinase mutant dhkD with ammonia, indicating that dhkD acts upstream of adgf via ammonia signalling. Therefore, it is important to retain this figure.

      Figure S4 (now Figure. S5): This figure represents the developmental phenotype of other deaminase mutants. Unlike adgf mutants, mutations in other deaminases do not result in complete mound arrest, despite some of these genes exhibiting strong expression during development. This underscores the critical role of adenosine deamination in tip formation. Therefore, let this figure be retained.

      Figure S7 (now Figure. S8): Figure S8 presents the transcriptomic profile of ADGF during gastrulation and pre-gastrulation stages across different organisms, indicating that ADA/ADGF is consistently expressed during gastrulation in several vertebrates (Pijuan-Sala et al., 2019; Tyser et al., 2021). Notably, the process of gastrulation in higher organisms shares remarkable similarities with collective cell movement within the Dictyostelium mound (Weijer, 2009), suggesting a previously overlooked role of ammonia in organizer development. This implies that ADA may play a fundamental role in regulating morphogenesis across species, including Dictyostelium and vertebrates. Therefore, we would like to retain this figure.

      (20). Given the current state of knowledge, speculation about the possible role of ADGF in organiser function in amniotes seems far-fetched. It is worth noting that the streak is not equivalent to the organiser. The discussion would benefit from limiting itself to the key results and implications.

      The discussion is revised accordingly by removing the speculative role of ADGF in organizer function in amniotes. The lines “It is likely that ADA plays a conserved, fundamental role in regulating morphogenesis in Dictyostelium and other organisms including vertebrates” have been removed.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Weaknesses:

      (1) Figure 10 outlines a mechanistic link between cyp17a2 and the sexual dimorphism the authors report for SVCV infection outcomes. The data presented on increased susceptibility of cyp17a2-/- mutant male zebrafish support this diagram, but this conclusion is fairly weak without additional experimentation in both males and females. The authors justify their decision to focus on males by stating that they wanted to avoid potential androgen-mediated phenotypes in the cpy17a2 mutant background (lines 152156), but this appears to be speculation. It also doesn't preclude the possibility of testing the effects of increased cyp17a2 expression on viral infection in both males and females. This is of critical importance if the authors intend to focus the study on sexual dimorphism, which is how the introduction and discussion are currently structured.

      Thank you for your suggestion. We have revised the relevant statements in the introduction and discussion sections accordingly. The cyp17a2 overexpression experiments were not conducted in both male and female individuals was primarily based on two reasons. First, our laboratory currently lacks the technical capability to achieve cyp17a2 overexpression at the organismal level, existing methodologies are limited to gene knockout via CRISPR-Cas9. Second, even if overexpression were feasible, subsequent comparisons would need to be restricted within sexes (i.e., female vs. female controls or male vs. male controls) to eliminate potential confounding effects of sex hormones. Such experimental outcomes would only demonstrate the antiviral function of Cyp17a2 itself rather than directly elucidate mechanisms underlying sexual dimorphism, which diverges from the central objective of this study.

      We fully agree with your perspective and have accordingly refined relevant discussions in the revised manuscript. Our conclusions now emphasize that "cyp17a2 is one of the factors contributing to sex-based differences in antiviral immunity" rather than implying that it "solely mediates the entire phenotypic divergence." These modifications have been incorporated into the resubmitted version (Lines 112-115).    

      (2) The authors present data indicating an unexpected link between cyp17a2 and ubiquitination pathways. It is unclear how a CYP450 family member would carry out such activities, and this warrants much more attention. One brief paragraph in the discussion (starting at line 448) mentions previous implications of CYP450 proteins in antiviral immunity, but given that most of the data presented in the paper attempt to characterize cyp17a2 as a direct interactor of ubiquitination factors, more discussion in the text should be devoted to this topic. For example, are there any known domains in this protein that make sense in this context? Discussion of this interface is more relevant to the study than the general overview of sexual dimorphism that is currently highlighted in the discussion and throughout the text.

      We are grateful to the reviewer for their suggestion to elaborate on this novel finding. The discussion on this point has been expanded significantly (Lines 448-460). It is acknowledged that Cyp17a2 is devoid of the canonical domains that are typically associated with the ubiquitination machinery (e.g., RING, U-box). The present study proposes that the endoplasmic reticulum (ER) localization of Cyp17a2, in conjunction with its capacity to function as a scaffold protein, is of paramount significance. By residing in the ER, Cyp17a2 is strategically positioned to interact with key immune regulators such as STING, which also localizes to the ER. It is hypothesized that Cyp17a2 facilitates the recruitment of E3 ligases (btr32) and deubiquitinates (USP8) to their substrates (STING and SVCV P protein, respectively) by providing a platform for protein-protein interactions, rather than directly catalyzing ubiquitination. This noncanonical, scaffolding role for a cytochrome P450 (CYP450) enzyme represents an exciting evolutionary adaptation in teleost immunity.

      (3) Figures 2-9 contain information that could be streamlined to highlight the main points the authors hope to make through a combination of editing, removal, and movement to supplemental materials. There is a consistent lack of clarity in these figures that could be improved by supplementing them with more text to accompany the supplemental figures. Using Figure 2 and an example, panel (A) could be removed as unnecessary, panel (B) could be exchanged for a volcano plot with examples highlighting why cyp17a2 was selected for further study and also the full dataset could be shared in a supplemental table, panel (C) could be modified to indicate why that particular subset was chosen for plotting along with an explanation of the scaling, panel (D) could be moved to supplemental because the point is redundant with panels (A) and (C), panel (E) could be presented as a heatmap, in panels (G) and (H) data from EPC cells could be moved to supplemental because it is not central to the phenotype under investigation, panels (J) to (L) and (N) to (P) could be moved to supplemental because they are redundant with the main points made in panels (M) and (Q). Similar considerations could be made with Figures 3-9.

      We thank the reviewer for these excellent suggestions to improve the clarity and focus of our figures. A comprehensive review of all figures has been conducted in accordance with the recommendations made. Figure 2A has been removed. Figure 2B (revised Figure 2A) has been replaced with a volcano plot highlighting cyp17a2 and the full dataset has been provided as supplementary Table S2. Figure 2C (revised Figure 2B) is now a heatmap with eight sex-related genes and an explanation of the scaling has been added to the revised figure legends. Several panels (D, G, H, J-L, N-P) have been moved to the supplementary information (now Figure S1). Figure 2E has been presented as a heatmap. The same approach to streamlining has been applied to Figures 3-9, with confirmatory or secondary data being moved to supplements in order to better emphasize the main conclusions. The figure legends and main text have been updated accordingly.

      (4) The data in Figure 3 (A)-(C) do not seem to match the description in the text. That is, the authors state that cyp17a2 overexpression increases interferon signaling activity in cells, but the figure shows higher increases in vector controls. Additionally, the data in panel (H) are not described. What genes were selected and why, and where are the data on the rest of the genes from this analysis? This should be shared in a supplemental table.

      We apologize for the lack of clarity. In Figures 3A-C, the vector control shows baseline activation due to the stimulants (poly I:C/SVCV), but the fold-increase is significantly greater in the Cyp17a2-overexpressing groups. We have re-plotted the data to more clearly represent the stimulant-induced activation over baseline and added statistical comparisons between the Vector and Cyp17a2 groups under each condition to highlight the enhancing effect of Cyp17a2. For Figure 3H (revised Figure 3F), the heatmap shows a curated set of IFN-stimulated genes (ISGs) most significantly regulated by Cyp17a2 based on our RNA-seq analysis. We have added a description in the revised figure legend and in the results section (Lines 837-840). The full list of differentially expressed genes from this analysis is now provided in Supplementary Table S3.

      (5) Some of the reagents described in the methods do not have cited support for the applications used in the study. For example, the antibody for TRIM11 (line 624, data in Figures 6 & 7) was generated for targeting the human protein. Validation for use of this reagent in zebrafish should be presented or cited. Furthermore, the accepted zebrafish nomenclature for this gene would be preferred throughout the text, which is bloodthirsty-related gene family, member 32.

      We thank the reviewer for raising this important point regarding reagent specificity. To address the concern about antibody validation in zebrafish, we performed the following verification steps. First, we aligned the antigenic sequence targeted by the Abclonal btr32 antibody (ABclonal, A13887) with orthologous sequences from zebrafish, which showed 45% protein sequence similarity (Author response image 1). More importantly, we conducted experimental validation by expressing Myc-tagged btr32 in EPC cells. Both the anti-Myc and the anti-btr32 antibodies detected a protein band at the same molecular weight. Furthermore, when a btr32-specific knockdown plasmid was introduced, the band recognized by the anti-btr32 antibody was significantly reduced (Author response image 2). These results support the specificity of the antibody in recognizing fish btr32. In accordance with the reviewer’s suggestion, we have also updated the gene nomenclature to “bloodthirsty-related gene family, member 32 (btr32)” throughout the manuscript.

      Author response image 1.

      Author response image 2.

      Reviewer #2 (Public review):

      Weaknesses:

      (1) Colocalization analyses (Figures 4G, 6I, 9D) require quantitative metrics (e.g., Pearson's coefficients) rather than representative images alone.

      We concur with the reviewer's assessment. We have now performed quantitative colocalization analysis (Pearson's coefficients) for all indicated figures (4G, 6I, 9D). The quantitative results are now presented within the figures themselves and described in the revised figure legends.

      (2) Figure 1 survival curves need annotated statistical tests (e.g., "Log-rank test, p=X.XX")

      The survival curves have now been annotated with the specific p-values from the Log-rank (Mantel-Cox) test (see revised Figures 1A, 2E).

      (3) Figure 2P GSEA should report exact FDR-adjusted *p*-values (not just "*p*<0.05").

      Figure 2P (revised Figure S1J) has been updated to include the exact FDR p-values for the presented GSEA plots.

      (4) Section 2 overextends on teleost sex-determination diversity, condensing to emphasize relevance to immune dimorphism would strengthen narrative cohesion.

      The section on teleost sex-determination diversity in the Discussion (lines 357-365) has been condensed, with a more direct focus on how this diversity provides a unique context for studying immune dimorphism independent of canonical sex chromosomes, as exemplified by the zebrafish model.

      (5) Limited discussion on whether this mechanism extends beyond Cyprinidae and its implications for teleost adaptation.

      The discussion has been expanded (lines 375-386) to address the potential conservation of this mechanism. It is acknowledged that cyp17a2 is a teleost-specific gene, and it is hypothesized that its function in antiviral immunity may signify an adaptive innovation within this extensively diverse vertebrate group. It is suggested that further research in other teleost families will be essential to ascertain the broader evolutionary significance of the present findings.

      Reviewer #2 (Recommendations for the authors):

      (1) Expand the Discussion to address why teleosts may have evolved male-biased immunity. Consider: pathogen pressure differentials in aquatic vs. terrestrial environments; trade-offs between immune investment and reproductive strategies (e.g., male-male competition); comparative advantages in external fertilization systems.

      We have expanded the discussion on lines 412-430, to address the potential conservation of this mechanism. We note that Cyp17a2 is a teleost-specific gene and speculate that its role in antiviral immunity represents an adaptive innovation within this highly diverse group of vertebrates. We propose that future studies of other teleost families are crucial for determining the broader evolutionary significance of our findings.

    1. Author response:

      Reviewer #1 (Public Review):

      We thank the Reviewer for the favorable feedback. The major concern is the collateral degradation of GSPT1. As the Reviewer noted, IWR1-POMA was able to suppress colony formation in DLD-1 cells resistant to GSPT1/2 degrader, suggesting that TNKS but not GSPT degradation is responsible for growth inhibition.

      We also appreciate that the Reviewer brought it to our attention an important early observation of the TNKS scaffolding effects. Cong reported in 2009 that overexpression of TNKS induced AXIN puncta formation in a SAM but not PARP domain-dependent manner (PMID 19759537). We will include this information in the revised manuscript.

      Reviewer #2 (Public Review):

      We thank the Reviewer for the encouraging and insightful comments. The major critique concerns whether TNKS degraders can suppress WNT/β-catenin signaling more effectively than TNKS inhibitors at endogenous TNKS levels. Fig. 1D shows that IWR1-POMA reduced the level of cytosolic β-catenin more effectively than IWR1 in Wnt3A-stimulated HEK293 cells without protein overexpression, and Fig. S7B shows that IWR1-POMA reduced STF signals more effectively than IWR1 in DLD-1 and SW480 cells with endogenous TNKS expression. We will corroborate these findings with additional cell lines during the revision.

      (1) We agree with the Reviewer that on-target toxicities pose challenges to the development of WNT inhibitors. For example, LGK974 that inhibits PORCN to prevent the secretion of all WNT proteins showed significant on-target toxicity in human (PMC10020809), and G007-LK that inhibits TNKS to block canonical WNT signaling selectively exhibited weak efficacy and dose-limiting toxicity at 5‒30 mg/kg BID or 10‒60 mg/kg QD in various mouse xenograft models (PMID: 23539443). Similarly, G-631, another TNKS inhibitor, also showed dose-limiting toxicity without significant efficacy at 25‒100 mg/kg QD in mice (PMID: 26692561). However, G007-LK was well-tolerated at 200 mg/kg QD over 3 weeks in mice in another study (PMC5759193). Treating mice with G007-LK at 10 mg/kg QD over 6 months also improved glucose tolerance without notable toxicity (PMID 26631215). Importantly, constitutive silencing of both TNKS for 150 days in APC-null mice prevented tumorigenesis without damaging the intestines (PMC6774804). Furthermore, basroparib, a selective TNKS inhibitor, was well tolerated in a recent clinical trial (PMC12498271). We are therefore cautiously optimistic that TNKS degraders will have an improved therapeutic index compared with TNKS inhibitors.

      (2) We agree with the Reviewer that Henderson's 2016 paper (PMC4773256) shed important light on the role of TNKS scaffolding in the DC. However, whereas this study demonstrated that knocking down both TNKS by siRNA prevented G007-LK to induce AXIN puncta, the function role of TNKS scaffolding in the DC remained unaddressed. We will include a more detailed description on Henderson's discovery.

      (3) Indeed, Guettler demonstrated that TNKS scaffolding could promote WNT/β-catenin signaling in 2016, which forms the basis of the current work. Meanwhile, whereas there have been efforts to target the SAM or ARC domain to address TNKS scaffolding, our approach of targeting TNKS for degradation is complementary. We will provide a more detailed discussion of these studies.

      (4) Biomolecular condensates are membrane less cellular compartments formed by phase separation of biomolecules, regardless of the physical/material properties (PMID: 28935776 and PMC7434221). Super-resolution microscopy studies by Peifer and Stenmark (PMC4568445 and PMID 26124443) showed that AXIN, APC, TNKS, and β-catenin interacted with each other to assemble into membraneless complexes, wherein AXIN and APC formed filaments throughout the DC. Peifer has also summarized evidence that supports the condensate nature of the DC (PMC6386181). However, we acknowledge that testing the physical properties of reconstituted DC (PMC8403986) will provide a better understanding of the nature, for example liquid vs. gel, of these condensates.

      (5) We will evaluate the ability of IWR1 and IWR1-POMA to engage TNKS.

      (6) We will modify Fig. 1F to improve clarity and readability.

      (7) Fig. S7B shows that IWR1-POMA suppressed WNT/β-catenin signaling more effectively than IWR1 in APC-mut DLD-1 and SW480 CRC cells without TNKS overexpression. Similarly, Fig. S6B shows that IWR1-POMA provided a deeper suppression of STF signals in HeLa cells transfected with AXIN1 and β-catenin while expressing endogenous TNKS. These results provide evidence that inhibitor-induced TNKS scaffolding plays a significant role at endogenous TNKS expression levels. Separately, we will reorganize the figures to better present Fig. 7C and D as suggested by the Reviewer.

      (8) We will rephrase "TNKS accumulation negatively impacts the catalytic activity of the DC".

      (9) We apologize for confusing β-catenin phosphorylation with β-catenin abundance. Here, we refer the catalytic activity of the DC to as the ability of the DC to promote β-catenin degradation rather than the kinetics of β-catenin phosphorylation and ubiquitination. It is commonly observed that AXIN stabilization by TNKS inhibitors increases the DC size and reduces the β-catenin levels. Peifer has also noted that APC can increase the size and the "effective activity" of the DC (PMC5912785 and PMC4568445). As such, the induction of AXIN puncta by TNKS inhibitors is frequently used as an indicator of WNT/β-catenin pathway inhibition. However, because the DC only primes β-catenin but does not catalyze its degradation, we will revise our manuscript to improve accuracy and clarity.

      (10) We will examine the effects of IWR1 and IWR1-POMA in additional cell lines, quantify the colony formation data, and reorganize the figures.

      (11) As discussed above, evidence for on-target toxicity of WNT/β-catenin inhibition is mixed. Yet, the observation of no dose-limiting toxicity for basroparib at doses up to 360 mg QD in human (PMC12498271) is encouraging. PROTAC works by catalyzing target degradation, which is different from traditional catalytic inhibitors that require continuous target occupancy at a high level. Because IWR1-POMA has a durable effect on TNKS, we expect that a fully optimized TNKS degrader may allow less frequent dosing than basroparib and consequently an even more favorable therapeutic window.

      (12/13) We will include quantification data, replicate information, and nuclei staining or cell outlines for the fluorescence microscopy experiments.

      (14) Cytosolic fractions of cells were prepared using a commercial cytoplasmic extraction kit following manufacturer's instructions. We will include detailed information in the revised manuscript.

      Reviewer #3 (Public Review):

      We thank the Reviewer for the helpful suggestions.

      (1) We will modify the title to include the PROTAC aspect.

      (2) As the Reviewer suggested, the bell-shaped dose response of the PROTAC originated from the formation of saturated binary complexes. At high PROTAC concentrations, binding of TNKS and CRBN/VHL by separate PROTAC molecules impedes the formation of productive ternary complexes, which results in reduced degradation efficacy and consequently the hook effect.

      (3) The structure-activity relationship of PROTACs is often unpredictable, as both the kinetics and thermodynamics of the target and E3 ligase binding play crucial roles. The lack of translation in degradation efficacy from IWR1 to G007-LK derived PROTACs may originate from differences in the binding kinetics or subtle changes in the orientation of the linker exit vector. We will include data on G007-LK in the revised manuscript.

      (4) We will quantify the Western blots, immunofluorescence images, colony formation data, and the replicate information.

    1. Author response:

      Reviewer #1 (Public Review):

      Summary:

      The authors aim to demonstrate that GWAS summary statistics, previously considered safe for open sharing, can, under certain conditions, be used to recover individual-level genotypes when combined with large numbers of high-dimensional phenotypes. By reformulating the GWAS linear model as a system of linear programming constraints, they identify a critical phenotypeto-sample size ratio (R/N) above which genotype reconstruction becomes theoretically feasible.

      Strengths:

      There is conceptual originality and mathematical clarity. The authors establish a fundamental quantitative relationship between data dimensionality and privacy leakage and validate their theory through well-designed simulations and application to the GTEx dataset. The derivation is rigorous, the implementation reproducible, and the work provides a formal framework for assessing privacy risks in genomic research

      We thank the reviewer for the positive assessment of our work’s conceptual originality, mathematical rigor, and reproducible implementation.

      Weaknesses:

      The study simplifies assumptions that phenotypes are independent, which is not the truth, and are measured without noise. Real-world data are highly correlated across different levels, not only genotype but also multi-omics, which may overstate recovery potential. The empirical evidence, while illustrative, is limited to small-scale data and idealized conditions; thus, the full practical impact remains to be demonstrated. GTEx analysis used only whole blood eQTL data from 369 individuals, which cannot capture the complexity, sample heterogeneity, or cross-tissue dependencies typical of biobank-scale studies

      We recognize the concern regarding the independence and noiselessness assumptions in our frame work. While assuming independent, noiseless phenotypes represents an idealized scenario, it allows us to clearly demonstrate the conceptual potential of our framework. The GTEx whole blood analysis is intended as a proof-of-concept, illustrating feasibility rather than capturing full biological complexity. In the revised manuscript, we will clarify these assumptions, emphasize that practical reconstruction accuracy maybe lower in correlated and noisy real-world data, and expand empirical validation to multiple GTEx tissue sand independent cohorts to demonstrate robustness under more realistic conditions.

      Reviewer #2 (PublicReview):

      Summary:

      This study focuses on the genomic privacy risks associated with Genome-Wide Association Study (GWAS) summary statistics, employing a three-tiered demonstration framework of” theoretical derivation- simulation experiments- real-data validation”. The research finds that when GWAS summary statistics are combined with high-dimensional phenotypic data, genotype recovery and individual re-identification can be achieved using linear programming methods. It further identifies key influencing factors such as the effective phenotype-to-sample sizeratio(R/N) and minor allele frequency(MAF). These findings provide practical reference for improving data governance policies in genomic research, holding certain real-world significance

      Strengths:

      This study integrates theoretical analysis, simulation validation, and the application of real world datasets to construct a comprehensive research framework, which is conducive to understanding and mitigating the risk of private information leakage in genomic research

      We are glad the reviewer values our integration of theory, simulation, and real data

      Weaknesses:

      (1) Limited scope of variant types covered:

      The analysis is conducted solely on Single Nucleotide Polymorphisms(SNPs), omitting other crucial genomic variant types such as Copy Number Variations(CNVs), Insertions/Deletions (InDels), and chromosomal translocations/inversions. From a genomic structure perspective, variants like CNVs and InDels are also core components of individual genetic characteristics, and in some disease-related studies, association signals for these variants can be even more significant than those for SNPs. From the perspective of privacy risk logic, the genotypes of these variants (e.g., copy number for CNVs, base insertion/deletion status for InDels) can also be quantified and could theoretically be inferred backwards using the combination of ”summary statistics +high-dimensional phenotypes”. Their privacy leakage risks might differ from those of SNPs(for instance, rare CNVs might be more easily re-identified due to higher genetic specificity)

      This point raises an important clarification regarding variant types beyond SNPs. We would like to clarify that our mathematical framework is not inherently restricted to SNPs. In fact, it is broadly applicable to any genetic variant that can be represented numerically, e.g., allelic dosage (0/1/2), copy number counts for CNVs, or presence/absence indicators for InDels. Conceptually, CNVs , InDels, and other structural variants can be incorporated in the same way as SNPs.

      The main limitation arises from the current availability of GWAS summary statistics for these non-SNP variant types (e.g., CNV dosages≥3), which are still relatively scarce. As a result, empirically evaluating our framework on these variant classes would be challenging. In the revision, we will explicitly emphasize the general applicability of our framework to diverse genetic variants while clearly noting this practical limitation. We also plan to include simulations to investigate the recovery accuracy associated with CNVs and InDels, which will further demonstrate the extensibility of our approach. It should be noted, however, that leaking genotypic data of ordinary SNPs already raises concerns, regardless of other types of genetic variants.

      (2) Bias in data applicability scope:

      Both the simulation experiments and real-data validation in the study primarily rely on European population samples (e.g.,489 Europe an samples from the 1000 Genomes Project; the genetic background of whole blood tissue samples from the GTEx project is not explicitly mentioned regarding non-European proportions). It only briefly notes a higher risk for African populations in the individual re-identification risk assessment, without conducting systematic analyses for other populations, such as East Asian, South Asian, or admixed American populations. Significant differences in genetic structure (e.g., MAF distribution, linkage disequilibrium patterns) exist across different populations. This may result in the R/N threshold and the relationship between MAF and recovery accuracy identified in the study not being fully applicable to other populations.

      Hence, addressing the aforementioned issues through supplementary work would enhance the study’s scientific rigor and application value, potentially providing more comprehensive theoretical and technical support for” privacy protection” in genomic data sharing.

      We acknowledge this valid concern regarding the generalizability of our findings. Our analysis already identifies MAF as a key factor influencing recovery accuracy, which begins to address population-specific genetic differences. Importantly, because our reconstruction method treats each variant independently, its success does not rely on population-specific LD patterns. The core determinant of feasibility is the ratio of phenotypic dimensions to sample size(R/N), a relationship we expect to hold a cross populations.

      Nevertheless, we agree that further validation across diverse ancestries can be helpful. In the revised manuscript, we will try to include additional cohorts as extended validation analyses

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The researchers sought to determine whether Ptbp1, an RNA-binding protein formerly thought to be a master regulator of neuronal differentiation, is required for retinal neurogenesis and cell fate specification. They used a conditional knockout mouse line to remove Ptbp1 in retinal progenitors and analyzed the results using bulk RNA-seq, single-cell RNA-seq, immunohistochemistry, and EdU labeling. Their findings show that Ptbp1 deletion has no effect on retinal development, since no defects were found in retinal lamination, progenitor proliferation, or cell type composition. Although bulk RNA-seq indicated changes in RNA splicing and increased expression of late-stage progenitor and photoreceptor genes in the mutants, and single-cell RNA-seq detected relatively minor transcriptional shifts in Müller glia, the overall phenotypic impact was low. As a result, the authors conclude that Ptbp1 is not required for retinal neurogenesis and development, thus contradicting prior statements about its important role as a master regulator of neurogenesis. They argue for a reassessment of this stated role. While the findings are strong in the setting of the retina, the larger implications for other areas of the CNS require more investigation. Furthermore, questions about potential reimbursement from Ptbp2 warrant further research. 

      Strengths: 

      This study calls into doubt the commonly held belief that Ptbp1 is a critical regulator of neurogenesis in the CNS, particularly in retinal development. The adoption of a conditional knockout mouse model provides a reliable way for eliminating Ptbp1 in retinal progenitors while avoiding the off-target effects often reported in RNAi experiments. The combination of bulk RNA-seq, scRNA-seq, and immunohistochemistry enables a thorough examination of molecular and cellular alterations at both embryonic and postnatal stages, which strengthens the study's findings. Furthermore, using publicly available RNA-Seq datasets for comparison improves the investigation of splicing and expression across tissues and cell types. The work is wellorganized, with informative figure legends and supplemental data that clearly show no substantial phenotypic changes in retinal lamination, proliferation, or cell destiny, despite identified transcriptional and splicing modifications. 

      We thank the Reviewer for their evaluation of the strengths of the study.

      Weaknesses: 

      The retina-specific method raises questions regarding whether Ptbp1 is required in other CNS locations where its neurogenic roles were first proposed. The claim that Ptbp1 is "fully dispensable" for retinal development may be toned down, given the transcriptional and splicing modifications identified. The possibility of subtle or transitory impacts, such as ectopic neuron development followed by cell death, is postulated, but not completely investigated. Furthermore, as the authors point out, the compensating potential of increased Ptbp2 warrants additional exploration. Although the study performs well in transcriptome and histological analyses, it lacks functional assessments (such as electrophysiological or behavioral testing) to determine if small changes in splicing or gene expression affect retinal function. While 864 splicing events have been found, the functional significance of these alterations, notably the 7% that are neuronalenriched and the 35% that are rod-specific, has not been thoroughly investigated. The manuscript might be improved by describing how these splicing changes affect retinal development or function. 

      We have revised the text to address these points as requested.

      Reviewer #2 (Public review): 

      Summary: 

      Ptbp1 has been proposed as a key regulator of neuronal fate through its role in repressing neurogenesis. In this study, the authors conditionally inactivated Ptbp1 in mouse retinal progenitor cells using the Chx10-Cre line. While RNA-seq analysis at E16 revealed some changes in gene expression, there were no significant alterations in retinal cell type composition, and only modest transcriptional changes in the mature retina, as assessed by immunofluorescence and scRNAseq. Based on these findings, the authors conclude that Ptbp1 is not essential for cell fate determination during retinal development. 

      Strengths: 

      Despite some effects of Ptbp1 inactivation (initiated around E11.5 with the onset of Chx10-Cre activity) on gene expression and splicing, the data convincingly demonstrate that retinal cell type composition remains largely unaffected. This study is highly significant since it challenges the prevailing view of Ptbp1 as a central repressor of neurogenesis and highlights the need to further investigate, or re-evaluate, its role in other model systems and regions of the CNS. 

      We thank the Reviewer for their evaluation of the strengths of the study.

      Weaknesses: 

      A limitation of the study is the use of the Chx10-Cre driver, which initiates recombination around E11. This timing does not permit assessment of Ptbp1 function during the earliest phases of retinal development, if expressed at that time.  

      We have revised the text to address the potential limitations of the use of the Chx10-Cre driver in this study.

      Reviewer #1 (Recommendations for the authors):

      (1) The author only selected scRNA-Seq datasets to examine the expression patterns of Ptbp1 in the retina; incorporating immunostaining analysis in the mouse retina is necessary.

      Ptbp1 expression patterns in the mouse retina were performed in Fig. 1b-1d, where Ptbp1 expression was analyzed via immunostaining for Ptbp1 protein in Chx10-Cre control and Ptbp1KO retinas at E14, P1, and P30, and are quantified in Fig. 1e. 

      (2) In Figure 1, Ptbp1 signals were still detected in the KO mice, with the author suggesting that this may indicate cross-reactivity with an unknown epitope. Why is this unknown epitope only detected in the ganglion cell layer? Additional antibodies are needed to confirm the staining results. Furthermore, it is essential to verify the KO at the mRNA level using PCR. 

      We are unsure of the identity of this cross-reacting epitope, although it might be Ptbp2, which is enriched expressed in immature retinal ganglion cells (Fig. S1).  In any case, we do not believe that the identity of this epitope is not relevant to assessing the efficiency of Ptbp1 deletion, as it is not detectably expressed in retinal ganglion cells in any case (Fig. S1).

      Although the heatmap in Figure 2B indicates a decrease in Ptbp1 levels in the KO mice, the absence of statistical data makes it difficult to evaluate the KO efficiency. 

      Respectfully, we believe that Ptbp1 knockout efficiency is adequately addressed using immunohistochemistry, and that further statistical analysis is not essential here. 

      Cre staining of the Chx10-Cre;Ptbp1lox/lox mice or using reporter lines is also suggested to indicate the theoretically knockout cells. Providing high-power images of the Ptbp1 staining would help readers clearly recognize the staining signals.

      To clarify the identity of the knockout cells, we have updated Figure 1 to include the Chx10-CreEGFP staining which more clearly delineates the cells in which Ptbp1 is deleted. Regarding verification of the knockout, we believe additional PCR assays are not necessary, as we have already demonstrated efficient loss of Ptbp1 in Chx10-Cre-expressing cells at the RNA level by both single-cell RNA-sequencing and bulk RNA-sequencing, and also at the protein level by immunohistochemistry. Sun1-GFP Cre reporter lines are also used in Figures 1 and S2 to visualize patterns of Cre activity, a point which is now highlighted in the text. Together, these approaches provide sufficient evidence for effective Ptbp1 knockout. 

      (3) The possibility of ectopic neuron formation followed by cell death is intriguing but underexplored. Consider adding apoptosis assays (e.g., TUNEL staining) at early developmental stages to test this hypothesis.

      While apoptosis assays such as TUNEL staining would be helpful to address this hypothesis, we feel incorporating these additional experiments is currently beyond the scope of this study. We agree the possibility of cell death is intriguing and plan to explore this in future work.

      (4) On page 4, the statement "We did not observe any significant differences ... Chx10Cre;Ptbp1lox/lox mice (Fig. 2b,c)" should refer to Fig. 3b,c instead.

      We have changed the text to refer to Fig. 3b,c.

      (5) The labeling in Figure 3 as "Cre-Ptbp1" is inconsistent with the figure legend "Ptbp1-Ctrl.".

      This language was used because the samples for EdU staining in Figure 3 were Chx10-Cre negative Ptbp1<sup>lox/lox</sup> mice. We have updated the language in the manuscript and figure to reflect the genotypes more clearly. 

      (6) P30 mice are still sexually immature; the term "adolescent" or "juvenile" should be used instead of "adult."

      We have updated the language in the text from “adult” to “adolescent” to describe P30 mice, although the retina itself is mature by this age.

      Reviewer #2 (Recommendations for the authors):

      (1) As mentioned in the public review, a limitation of the study is that Ptbp1 KO is not induced prior to E11. The authors should acknowledge this limitation and include in the Discussion that the use of the Chx10-Cre line does not permit evaluation of a potential role for Ptbp1 during very early stages of retinal development, should it be expressed at that time (an aspect that would be important to determine).

      We and have added this limitation to the Discussion in the sentence highlighted below.

      Furthermore, the use of the Chx10-Cre transgene in this study does not exclude a potential role for Ptbp1 during very early stages of retinal development prior to E11 (pg. 6).

      (2) While the data convincingly show no significant changes in retinal cell type distribution in Ptbp1 mutants, the claims in the abstract and introduction that Ptbp1 is "dispensable for retinal development" or "dispensable for the process of neurogenesis" may be overstated. Indeed, the results indicate that loss of Ptbp1 function influences retinal development by promoting neurogenesis through induction of a neuronal-like splicing program in neural progenitors. Concluding solely that Ptbp1 is dispensable for retinal cell fate specification, rather than for retinal development as a whole, would thus seem more accurate.

      We have updated the language in the text to reflect Ptbp1’s role in regulating retinal cell fate specification more clearly.

      (3) The authors conclude from Figure 5 that "No changes in the identity or composition of any retinal cell type were observed." Which statistical test was applied to support this conclusion? The figure indicates that Müller cells comprise 10.5% of the total cell population in controls versus 8.2% in Ptbp1-KO retinas. It may be important to consider the overall distribution of glia versus all neurons (rather than each neuron subtype individually). While the observed difference (~2% more glia at the expense of neurons) appears modest, it would be important to determine whether this trend is consistent and statistically significant.

      To evaluate cell type composition, we performed differential expression analysis across all major retinal cell types and compared proportional cell type representation between control and Ptbp1 KO retinas. While these analyses did not reveal marked differences in any specific cell type, we acknowledge that the scRNA-Seq dataset includes a single experimental replicate, containing two retinas in each replicate. Therefore, we cannot draw firm statistical conclusions regarding the relative distribution of glia versus neurons, and the modest difference observed in glia cell proportion should be interpreted with caution. We agree that assessing glia-to-neuron ratios across additional replicates will be important in future studies.

      (4) Referringx to Figure S1 (scRNA-seq data), the authors state that Ptbp1 mRNA is robustly expressed in retinal progenitors and Müller glia in both mouse and human retina. While the immunostaining in Figure 4 indeed clearly shows strong expression in Müller cells, the scRNAseq data presented in Figure S1 do not support the claim of "robust" expression in Müller glia in the mouse retina. This is even more striking in the human data, where panels F and H show that Ptbp1 is expressed at extremely low, certainly not "robust", levels in Müller cells. The corresponding sentence in the Results section should therefore be revised to more accurately reflect the data presented in Figure S1, or be supported by complementary immunofluorescence evidence.

      We thank the reviewer for this comment. We have revised this section of the Results to better reflect Fig S1, as follows:

      We observe high expression levels of Ptbp1 mRNA in primary retinal progenitors in both species and Müller glia in mouse retina, with weaker expression in neurogenic progenitors, and little expression detectable in neurons at any developmental age.

      (5) When mentioning potential compensation by Ptbp2, the authors may also consider discussing the possibility that compensatory mechanisms can differ between knockdown and knockout approaches. In this context, it is noteworthy that a recent study by Konar et al., Exp Eye Res, 2025 (published after the submission of the present manuscript) reports that Ptbp1 knockdown promotes Müller glia proliferation in zebrafish.

      We thank the reviewer for this suggestion. To address this, we have included a section considering this possibility in the discussion section highlighted below.

      It is also possible that compensatory mechanisms differ between knockdown and knockout approaches. Notably, a recent study (Konar et al. 2025) reported that Ptbp1 knockdown promotes Müller glia proliferation in zebrafish, suggesting that effects of acute reduction of Ptbp1 may not fully mirror those of complete loss-of-function. 

      (6) The statistical analyses were performed using a t-test. However, this parametric test is not appropriate for experiments with low sample sizes. A non-parametric test, such as the MannWhitney test, would be more suitable in this context. Furthermore, performing statistical analysis on n = 2 (Figure 3C) is not statistically valid.

      We thank the reviewer for this comment. We agree that with a small n, non-parametric tests are more appropriate. We have added additional retinas (now n=5) for the Ptbp1-KO condition in Figure 3C and reanalyzed with the appropriate non-parametric Mann-Whitney test. For all other datasets with sufficient replicates (n≥ 4/genotype), parametric tests such as unpaired t-tests remain valid, and the results are consistent with non-parametric testing. 

      (7) Figure S3 is accompanied by only a brief explanation in the Results section (a single sentence despite the figure containing six panels), which makes it difficult for readers unfamiliar with this type of data to interpret.

      We thank the reviewer for the suggestion. To address this, we have included a more detailed explanation of Supplementary Figure S3 to better clarify our analysis of mature neuronal and glial cell types in both Ptbp1-deficient and wild-type animals. The relevant text now reads:

      Notably, splicing patterns in Ptbp1-deficient retinas showed stronger correlation with Thy1positive neurons— which exhibit low Ptbp1 expression—and minimal overlap with microglia and auditory hair cells, the adult cell types with the highest Ptbp1 levels (Fig. S3).

      Gene expression and splicing changes were compared across several reference tissues: heart tissue and Thy1-positive neurons, mature hair cells, microglia, and astrocytes (Fig. S3a,b). A heatmap of differentially expressed genes showed that while Ptbp1-deficient retinas diverged from WT retinas, their expression profiles did not resemble those of fully differentiated cell types like rods, astrocytes, or adult WT retina (Fig. S3c). Consistently, Pearson correlation analysis revealed that Ptbp1-deficient and WT retinas were more similar to each other than to fully differentiated neuronal or glial populations (Fig. S3d). Splicing profile analysis further revealed that while there was high correlation of PSI between Ptbp1-deficient and WT retinas, Ptbp1deficient retinas more closely resembled Thy1-positive neurons, whereas WT retinas aligned more strongly with mature cells such as astrocytes, microglia, and auditory hair cells (Fig. S3ef). Together, these results suggest that although Ptbp1 loss induces hundreds of alternative splicing events, the magnitude of PSI changes in the KO retinas remains considerably lower than that seen in fully differentiated cell types (Extended Data 3). Thus, while a subset of splicing events overlaps with those characteristic of mature neurons or rods, the overall splicing and expression profiles of KO retinas are more similar to those of developing retinal tissue rather than terminally differentiated neuronal or glial populations.

      (8) To assess progenitor proliferation, the authors performed EdU labeling experiments in P0 retinas. Is there a rationale for not examining earlier developmental time points to evaluate potential effects on early RPCs?

      We thank the reviewer for this comment. We chose to perform EdU labeling experiments at P0 for several reasons. P0 represents a developmental stage where RPCs are actively proliferating and represent ~35% of all retina cells, and the retina is transitioning to intermediate-late-stage development, providing sufficient time to ensure efficient and widespread disruption of Ptbp1. Earlier embryonic timepoints were not examined here, as addressing all stages of development was beyond the scope of this current study. However, we agree that investigating whether Ptbp1 plays stage-specific roles during development on early RPCs is an important question and potential future direction.

      (9) In Figure S2, panel D shows staining in GCL under the Ptbp1 condition that does not make sense and is inconsistent with panel C. If possible, the authors should provide an alternative image to prevent any confusion.

      Thank you for bringing this to our attention. The image shown for Ptbp1-KO in Figure 2d shows Sun1-eGFP labeling, which labels every cell affected by the Cre condition. The genotype for this mouse was Chx10-Cre;Ptbp1lox/lox;Sun1-GFP. We apologize for any confusion and have updated the genotype in the figure legend.

      (10) The authors should revise the following sentence at the end of the Discussion section, as its meaning is unclear: "...and conditions for in vitro analysis may have accurately replicated conditions in the native CNS."

      We thank the reviewer for this comment and have revised this sentence in the discussion for the sentence below.

      Previous studies using knockdown may have been complicated by off-target effects (Jackson et al. 2003), and conditions for in vitro analysis may not have accurately replicated conditions in the native CNS.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Reviews):

      Weaknesses:

      A limitation of the study is the reliance on standard techniques; however, this is a minor concern that does not diminish the overall impact or significance of the work.

      We agree that standard techniques were utilized. We believe this approach enhances the reliability and reproducibility of our findings. These methods are well-validated in the field and allow for robust interpretation of the results presented.

      Reviewer #2 (Public Reviews):

      Weaknesses: 

      (1) Clarify the strain background of the DBA/2J GPNMB+ mice: While DBA/2J GPNMB+ is described as a control, it would help to explicitly state whether these are transgenically rescued mice or another background strain. Are they littermates, congenic, or a separate colony?

      The following language was added to the manuscript, “The DBA/2J GPNMB+ mice are a coisogenic strain purchased from Jackson Laboratories. Jackon Laboratories generated these mice by knocking in the wild-type allele of Gpnmb into the DBA/2J background. By doing so, they rescued the phenotype of the DBA/2J mice. This description has been highlighted in our previous publications (Abdelmagid et al., 2014; Abdelmagid et al., 2015).”

      (2) Provide exact sample sizes and variance in all figure legends: Some figures (e.g., Figure 2 panels) do not consistently mention how many replicates were used (biological vs. technical) for each experimental group. Standardizing this across all panels would improve reproducibility.

      The manuscript has been updated to include replicates in each figure legend.

      (3) Expand on potential sex differences: The DMM model is applied only in male mice, which is noted in the methods. It would be helpful if the authors added 1-2 lines in the discussion acknowledging potential sex-based differences in OA progression and GPNMB function. 

      To our knowledge there are no sexbased differences in OA progression and GPNMB function in the literature. It was initially reported that only male C57BL/6J mice (Jackson Laboratories) develop OA following DMM however, recent literature has shown that both male and female mice develop the disease (Hwang et al., 2021; Ma et al., 2007). For the purpose of this manuscript, only male mice were used to provide preliminary results, however, we plan to repeat the included studies in female mice in the near future.  

      (4) Visual clarity in schematic (Figure 7): The proposed mechanism is helpful, but the text within the schematic is somewhat dense and could be made more readable with spacing or enlarged font. Also, label the MAPK/ERK pathway explicitly in panel B.

      We updated the schematic diagram in figure 7 and the figure legend.

      Reviewer #1 (Recommendations for the Authors):

      Several concerns must be addressed to improve the clarity and scientific rigor of the manuscript: 

      (1) Abstract: Specify which MMPs and MAPKs are modulated by osteoactivin.

      We specified the MMPs and clarified that GPNMB plays a role in pERK inhibition following inflammation induced by IL-1β stimulation. 

      (2) Human explant validation: The regulation of MMP-9, MMP-13, and IL-6 should be validated in the human cartilage explant model to support the claim that "GPNMB has an anti-inflammatory role in human primary chondrocytes" (line 123). Additionally, the anatomical origin of the explants must be stated.

      Thank you very much for the recommendation. We agree that validating the explant culture for MMP-9, MMP-13, and IL-6 would strengthen our data. Unfortunately, this experiment has been terminated and we no longer have access to the tissue. Human explants were obtained from discarded knee articular cartilage following arthroplasty. The manuscript has been updated to include this information.

      (3) DBA/2J GPNMB expression: GPNMB is known to be produced as a truncated protein in DBA/2J cells. The manuscript should address why its expression is reduced. Does this involve mRNA instability? Also, the nomenclature "DBA/2J GPNMB+" versus "DBA/2J" is confusing, especially since both mRNA and protein are still detectable, albeit at reduced levels. Figure 2C is not convincing; therefore, Figures 2C and 2D can be omitted.

      The following language was added to the manuscript, “Our results are consistent with the literature which shows that that the GPNMB gene in DBA/2J mice carries a nonsense mutation that leads to reduced RNA stability (Anderson et al., 2008).” We can appreciate that the nomenclature "DBA/2J GPNMB+" versus "DBA/2J" could be confusing. However, this is the standard language used in multiple publications, and we want to remain consistent with the literature. Based on your recommendation we have removed Figure 2 C and D and updated the methods and results sections accordingly.   

      (4) Figures 2J-L: The claim that gene expression changes are "significantly higher in DBA/2J animals compared to fold changes seen in chondrocytes from DBA/2J GPNMB+ controls" is not supported by the current presentation. The data should be plotted on the same graphs, and appropriate statistical analysis (e.g., two-way ANOVA) must be performed.

      Graphs for figure 2 have been updated and the appropriate analyses have been performed. 

      (5) Figure 6: The GPNMB expression data in the presence and absence of IL-1β at 0 and 10 minutes are missing.

      We apologize for the confusion. We corrected the mistake and removed the mention of the timepoints 0 and 10 minutes.  

      Reviewer #2 (Recommendations for the Authors):

      Consider unifying terminology around "GPNMB" and "osteoactivin": The term "osteoactivin" is used in some contexts and "GPNMB" in others. Since the focus is GPNMB's role in cartilage, suggest using a single term throughout to prevent confusion.

      Thank you for your comment. We include osteoactivin for clarification purposes once in the abstract, introduction and discussion. 

      In summary, we believe we have addressed all comments/concerns raised by the reviewers. We appreciate the opportunity to improve the quality of our manuscript.

      References

      Abdelmagid, S. M., Belcher, J. Y., Moussa, F. M., Lababidi, S. L., Sondag, G. R., Novak, K. M., Sanyurah, A. S., Frara, N. A., Razmpour, R., & Del Carpio-Cano, F. E. (2014). Mutation in osteoactivin decreases bone formation in vivo and osteoblast differentiation in vitro. The American journal of pathology, 184(3), 697-713. 

      Abdelmagid, S. M., Sondag, G. R., Moussa, F. M., Belcher, J. Y., Yu, B., Stinnett, H., Novak, K., Mbimba, T., Khol, M., Hankenson, K. D., Malcuit, C., & Safadi, F. F. (2015). Mutation in Osteoactivin Promotes Receptor Activator of NFκB Ligand (RANKL)-mediated Osteoclast Differentiation and Survival but Inhibits Osteoclast Function. J Biol Chem, 290(33), 2012820146. https://doi.org/10.1074/jbc.M114.624270  

      Anderson, M. G., Nair, K. S., Amonoo, L. A., Mehalow, A., Trantow, C. M., Masli, S., & John, S. W. (2008). GpnmbR 150Xallele must be present in bone marrow derived cells to mediate DBA/2J glaucoma. BMC genetics, 9(1), 1-14. 

      Hwang, H., Park, I., Hong, J., Kim, J., & Kim, H. (2021). Comparison of joint degeneration and pain in male and female mice in DMM model of osteoarthritis. Osteoarthritis and Cartilage, 29(5), 728738. 

      Ma, H.-L., Blanchet, T., Peluso, D., Hopkins, B., Morris, E., & Glasson, S. (2007). Osteoarthritis severity is sex dependent in a surgical mouse model. Osteoarthritis and Cartilage, 15(6), 695-700.

    1. Author Response

      Reviewer #3 (Public Review):

      Myelodysplastic syndrome (MDS) is a heterogenous, clonal hematopoietic stem cell disorder characterized by morphological dysplasia in one or more hematopoietic lineages, cytopenias (most frequently anemia), and ineffective hematopoiesis. In patients with MDS, transfusion therapy treatment causes clinical iron overload; however it has been unclear if treatment with iron chelation yields clinical benefits. In the present study, the authors use a transgenic mouse model of MDS, NUP98-HOXD13 (referred to here as "MDS mice") to investigate this area. Starting at 5 months of age (before MDS mice progress to acute leukemia), the authors administered DFP in the drinking water for 4 weeks, and compared parameters to untreated MDS mice and WT controls.

      The authors first show that MDS mice exhibit systemic iron overload and macrocytic anemia that is improved by treatment with the iron chelator deferiprone (DFP). They then perform a detailed characterization the effects of DFP treatment on erythroid differentiation and various parameters related to iron transport and trafficking in MDS erythroblasts. Strengths of the work are the use of a well-characterized mouse model of MDS with appropriate animal group sizes and detailed analyses of systemic iron parameters and erythroid subpopulations. A remediable weakness is that in certain areas of the Results and Discussion, the authors overinterpret their findings by inferring causation when they have only shown a correlation. Additionally, when drawing conclusions based on changes in erythroblast mRNA expression levels between groups, the authors should consider that translation efficiency may be altered in MDS and that the NUP98 fusion protein itself, by acting as a chimeric transcription factor, may also impact gene expression profiles. Given that the application of chelators for treatment of MDS remains controversial, this work will be of interest to scientists focused on erythroid maturation and iron dysregulation in MDS, as well as clinicians caring for patients with this disorder.

      Major Comments

      1) The authors define the stages of erythroblast differentiation using the CD44-FSC method, which assumes that CD44 expression levels during the stages of erythroid differentiation are not altered by MDS itself. Are morphologically abnormal erythroblasts, such as bi-nucleate forms, captured in this analysis, and if so, are they classified in the appropriate subset? The percentage of erythroblasts in the bone marrow of MDS mice in this current study is lower than that reported by Suragani et al (Nat Med 2014), who employed a different strategy to define erythroid precursors. While representative erythroblast gating is presented as Supplemental Figure 17, it would be important to present representative gating from all 3 animal groups: WT, MDS, and MDS+DFP mice.

      We appreciate this comment and have added representative gating for all 3 groups to Supplemental Figure 17 (new Figure 3 – figure supplement 6 in the revised manuscript).

      2) Methods, "Statistical analysis." The authors state that all comparisons were done with 2-tailed student paired t test, which would not be appropriate for comparisons being made between independent animals groups (i.e. when groups are not "paired").

      We appreciate this comment and have reanalyzed all revised mouse data using one-way ANOVA with multiple comparisons and Tukey post-test analyses when more than 2 groups were compared. This has been edited in the Methods section in the revised manuscript.

      3) The Results (p.7) indicates that both sexes showed similar responses to DFP; however, the figure legends do not indicate sex. Given that systemic iron metabolism in mice shows sex-related differences, sex should be specified.

      We appreciate this comment and present here the gender-specific data for the reviewers’ evaluation (Author respone image 1). Similarly elevated transferrin saturation (a) (n = 3-4 male mice/group and n = 4-6 female mice/group) and hemoglobin (b) (n = 4-6 male mice/group and n = 4-9 female mice/group) are observed in male and female DFP-treated MDS mice. (c) Bone marrow erythroblasts are decreased to a greater degree in male relative to female DFP-treated MDS mice (n = 4-7 male mice/group and n = 8-9 female mice/group). We have added the data on gender-specific measures to new Figure 1 - figure supplement 3, Figure 2 – figure supplement 1, and Figure 3 – figure supplement 1 in the revised manuscript.

      Author respone image 1.

    1. Author Response

      Reviewer #1 (Public Review):

      The manuscript by Xu et. al. does a very thorough characterization and molecular dissection of the role of SSH2 in spermatogenesis. Loss of SSh2 in germ cells results in germ cell arrest In step2-3 spermatids and eventually leads to germ cell loss by apoptosis. Molecular characterization of the mutant mice shows that the loss of SSH2 prevents the fusion of proacrosomal vesicles leading to the formation of a fragmented acrosome. The fragmentation of the acrosome is due to the impaired actin bundling and dephosphorylation of COFILIN. In short, this is a comprehensive body of work.

      We thank the referee for these insightful comments.

      Reviewer #2 (Public Review):

      The acrosome is a unique sperm-specific subcellular organelle required for the fertilization process, and it is also an organelle undergoing extensive morphological and structural transformation during sperm development. The mechanism underlying the extensive acrosome morphogenesis and biogenesis remains incompletely understood. Xu et al in their manuscript entitled "The Slingshot phosphatase 2 is required for acrosome biogenesis during spermatogenesis in mice" reported that the Slingshot Phosphatase 2 is essential for acrosome biogenesis and male fertility through their characterization of spermatogenic and acrosomal defects in Ssh2 knockout mice they generated. Specifically, the authors provided molecular, genetic, and subcellular evidence supporting that Ssh2 mutation impaired the phosphorylation of an acting-binding protein, COFILIN during spermiogenesis and accordingly actin cytoskeleton remodeling, crucial for proacrosomal vesicle trafficking and acrosome biogenesis. The manuscript by Xu et. al. does a very thorough characterization and molecular dissection of the role of SSH2 in spermatogenesis. Loss of SSh2 in germ cells results in germ cell arrest In step2-3 spermatids and eventually leads to germ cell loss by apoptosis. Molecular characterization of the mutant mice shows that the loss of SSH2 prevents the fusion of proacrosomal vesicles leading to the formation of a fragmented acrosome. The fragmentation of the acrosome is due to the impaired actin bundling and dephosphorylation of COFILIN. In short, this is a comprehensive body of work.

      We appreciate and thank Referee #2 for the positive feedback and insightful comments.

      Strengths:

      Nicely written manuscript, addresses an important mechanistic question of the roles of cytoskeleton remodeling in acrosome biogenesis and provided genetic, subcellular, and molecular evidence to build up their support for their hypothesis that Ssh2 regulates actin cytoskeleton remodeling, a process essential for proacrosomal vesicle trafficking and acrosome biogenesis, through dephosphorylation actin-binding protein during spermiogenesis.

      We again thank to the Referee #2 for appreciating and encouraging us regarding our current research work.

      Weaknesses:

      For body weight, and testis weight of the mutants, the authors concluded that there is no significant difference between the mutant and wildtype (Fig 1E -1G), but they appear to use mice between 6-8 wk old, both the testis and body weight of males at 6-8 wks is still growing, with the number of mice analyzed being six, you could easily miss the significant difference of the testis size and or body weight with such a varied age and a small sample size.

      We thank the referee for their prompting of this important discussion point, which we now cover in our revised manuscript. In our originally submitted manuscript, we only presented the data for body weight, testis weight, and T/B ratio for mice between the age of 6–8 weeks, however, we have added the additional data of mice with age more than 8 weeks in the revised manuscript in a new Figure 1E-1G with the sample size of 12 for each genotype. We have also updated the relevant content in the figure caption. The revised figure caption for Figure 1 panels E–G reads as follows: “(E-G) Body weights (26.3609 ± 0.4914 for WT; 25.1741 ± 0.5189 for Ssh2 KO), weights of the testes (0.0862 ± 0.0036 for WT; 0.0788 ± 0.0023 for Ssh2 KO), and the testis-to-body weight ratio (0.3281 ± 0.0153 for WT; 0.3154 ± 0.0135 for Ssh2 KO) of adult WT and Ssh2 KO males (n = 12). Data are presented as the mean ± SEM; p > 0.05 calculated by Student’s t-test. Bars indicate the range of the data.”

      Other points:

      Comments: 1) Could the uniform cytoplasmic distribution of diminutive actin filaments in the wild type and disrupted actin filament remodeling be examined at the EM level on the round spermatids?

      We apologize for the confusion. Previously, we conducted a transmission electron microscopy (TEM) analysis on the testes samples to discover the distribution and ultrastructural organization of F-actin in WT and Ssh2 KO round spermatids. Unfortunately, even at high magnification (30,000x, right panel of Figure R1-Response Figure 1) by TEM of testicular section no diminutive actin filament was observed in the cytoplasm of round spermatids except for the acroplaxome-an actin-rich specialized structure anchors the acrosome-in WT spermatids as well as some thick bundle-like structures located at the acrosomal region of Ssh2 KO spermatids (Fig. R1). According to their unique characteristic of appearance, we interpreted these electron-dense bundles as the aberrantly aggregated actin filaments whose lengths are in accordance with the lengths of COFILIN-saturated F-actin fragments (Bamburg et al., 2021), suggesting the disrupted actin filament remodeling during acrosome biogenesis resulted from Ssh2 KO. However, due to the technological limitations of TEM and the complexity of intracellular environment of round spermatids, we only recognized few aggregated actin bundles with the loss of filamentous appearance in Ssh2 KO spermatids and no typical diminutive actin filament was detected which had been imaged under high-resolution cryo-TEM (Haviv et al., 2008) or live-cell total internal reflection fluorescence microscopy (Johnson et al., 2015) on the purified actin bundles and cultured cells. Given the lack of effective approaches to culture murine round spermatids in vitro, confocal microscopy of flourescence-labelled F-actin (e.g., IF staining by FITC-phalloidin) is a more accessible method for visualizing the disruption of actin remodeling than EM in murine spermatids as the actin-related findings that several other studies demonstrated (Djuzenova et al., 2015; Meenderink et al., 2019).

      Comments: 2) Any other defects are seen besides acrosome in the mutant testis given the important roles of actin cytoskeleton network and high expression of Ssh2 in spermatocytes, were chromatoid bodies or mitochondria affected in any way? Any other defects in the mice overall including female fertility and other organs, given the previously reported roles in the nervous system. It could be helpful information for others interested in Ssh 2 protein and actin cytoskeleton's roles in general.

      The referee has here raised an interesting point. Firstly, besides the acrosome-related defects in Ssh2 KO spermatids, we identified increased germ cell apoptosis and aberrant activation of apoptotic Bcl-2/Caspase-3 pathway in the testes of Ssh2 KO mice which were speculated to be triggered by the disordered COFILIN-mediated F-actin remodeling and have attracted our attention to further elucidate the underlying mechanisms in the future. Secondly, given the high expression of SSH2 in spermatocytes demonstrated by IF staining shown in figure 4B and 4C,we thus performed the surface chromosome spreading on spermatocytes to observe whether the morphology of chromatid bodies and the meiotic progression was affected by Ssh2 KO and no obvious defects were observed as shown in supplementary Figure S3 in originally submitted manuscript. Thirdly, no obvious morphological abnormality in chromatin or mitochondrial structure was detected in Ssh2 KO germ cells such as spermatocytes and round spermatids under TEM which prevents us to pursue it further. Fourthly, we have observed the potential effect(s) of Ssh2 KO on female fertility using Ssh2 KO female mice and did not find any obvious infertility defect in Ssh2 KO females compared to their WT littermates as demonstrated by the data of the body weight, ovary weight, ovary-to-body weight ratio, size of ovaries and fertility test as well as the images of ovarian HE staining (Fig. R1). Moreover, given that during our investigation period, Ssh2 KO males and females did not manifest any defective physical development, aberrant physiological status or mental disorder notwithstanding the roles of SSH2 in neurite extension had been reported (Endo, Ohashi, & Mizuno, 2007), we did not conduct the experiments to observe the effect(s) of SSH2 in other organs except for the female fertility.

      Fig. R1 No reproductive defects were found in Ssh2 KO females. (A-C) Body weights, weights of the ovaries, and the ovary-to-body weight ratio of adult WT and Ssh2 KO females aged 8-10 weeks (n = 5); p > 0.05 calculated by Student’s t-test. Bars indicate the range of data. (D) The size of ovaries from Ssh2 KO were indistinguishable from ovaries of WT mice age 8 weeks, n = 4. (E) Histology of the ovaries from WT and Ssh2 KO mice. Sections were stained with hematoxylin and eosin. Scale bars: 200 μm. Images are representative of ovaries extracted from 8-week-old adult female mice per genotype. (F) Number of pups per litter from WT and Ssh2 KO male mice (8 weeks old) after crossing with WT adult male mice (n =3); p > 0.05 calculated by Student’s t-test. Bars indicate the range of the data.

      Comments: 3) Providing detailed information on the number of animals used and cells analyzed in the legend is nice, but it might be even better for the readers to include sample size and the number of cells examined in the figure/graph if possible.

      We appreciate the suggestions from the reviewer. We have integrated some information of sample size in the figures where appropriate. Firstly, we integrated sample size in the figure 1C, 1E, 1F, 1G and 1I. Secondly, we included sample size and the number of seminiferous tubule/epididymal duct we evaluated for TUNEL (+) cell counting in figure 2C and figure 2D. Thirdly, we included sample size and the number of spermatids for co-localization in figure 6B and figure 6D.

      Comments: 4) Nice discussion and comparison with GOPC and GM130, how about comparison and discussion with other acrosome defective mutants like PICK1, and ATG to provide some insights into acrosome biogenesis and proacrosomal vesicle trafficking?

      We greatly appreciate the referee for positive appraisal of our work with constructive suggestions, unfortunately, we are unable to address these defective mutants with certainty due to the lack of proper sample accessibility (only 3 of 16-month-old Ssh2 KO mice are accessible now). We compared the cytological staining of GM130 and GOPC in WT and Ssh2 KO spermatids using tubule squash sections as the description in the originally submitted manuscript which are prepared from fresh testes originated from 8-week-old mice and we now have several aged Ssh2 KO mice which prevent us to achieve the staining of PICK1 and ATG. PICK1 was previously reported to facilitate vesicle trafficking from the Golgi apparatus to the acrosome which co-localizes with GOPC in the proacrosomal granules (Xiao et al., 2009) and the phenotypes of Pick1 KO mice share a lot of similar characteristics with that of Ssh2 KO mice such as the fragmentation of the acrosome and increased germ cell apoptosis. Both autophagy-related ATG5 (Huang et al., 2021) and ATG7 (Wang et al., 2014) were reported to participate in the process of acrosome biogenesis and ATG7 is required for proacrosomal vesicle transportation/fusion by conjugating LC3 to the membrane of proacrosomal vesicles. Although the spermatids evaluated in these KO mice models could still be developed into spermatozoa with defective acrosome that is different from the situation in Ssh2 KO mice, it would be meaningful to discover the affects by Ssh2 KO on the localization of these regulators of acrosome biogenesis in spermatids and their potential interactions with SSH2. Indeed, in future work, we plan to pursue these issues and the content related to PICK1 has been added to the discussion in the revised manuscript as follows: “Moreover, it is intriguing to note that the phenotypes of Ssh2 KO mice share a lot of similarities with that of Pick1 KO model (Xiao et al., 2009) such as acrosome fragmentation and enhanced germ cell apoptosis, suggesting the possibility that SSH2 and PICK1 work together in a same trafficking machinery functioning in acrosome biogenesis which needs to be clarified further.”

      Comments: 5) Given the literature on Cofilin's requirement for male fertility and the increased p-Cofilin in Ssh2 mutant testis by Western and IF, the authors have a strong case for their hypothesis. But given the general role of phosphatase, it might be prudent to discuss alternative possibilities.

      We thank the reviewer for these valuable suggestions. Given that p-COFILIN is the only known substrate of SSH2 based on previous reports, we focused principally on this cascade to conduct our investigation. As a phosphatase, SSH2 is very likely to interact with many other proteins functioning in various cellular processes other than the actin-binding proteins which remain elusive. As directed, we now have added some content related to the regarding above concern in the discussion section of the revised manuscript as follows: “Given the diverse physiological roles reported for Slingshot family proteins, the possibility of the alternative mechanism underlying involvement of SSH2 in cellular events beyond the COFILIN-mediated actin remodeling should be noted. According to some publicly accessible databases as the indicators of potential protein–protein interactions such as BioGRID (Oughtred et al., 2019) and IntAct (Del Toro et al., 2022), SSH2 might interact with a set of actin-based molecular motors covering MYH9, MYO19 and MYO18A, which have been implicated in the maintenance of Golgi morphology and Golgi anterograde vesicular trafficking via the PI4P/GOLPH3/MYO18A/F-actin pathway (Rahajeng et al., 2019).”

    1. Author Response

      Reviewer #2 (Public Review):

      Zylbertal and Bianco propose a new model of trial-to-trial neuronal variability that incorporates the spatial distance between neurons. The 7-parameter model is attractive because of its simplicity: A neuron's activity is a function of stimulus drive, neighboring neurons, and global inhibition. A neuroscientist studying almost any brain area in any model organism could make use of this model, provided that they have access to 1) simultaneously-recorded neurons and 2) the spatial locations of those neurons. I could foresee this model being the de-facto model to compare to all future models, as it is easy to code up and interpret. The paper explores the effectiveness of this distance model by modeling neural activity in the zebrafish optic tectum. They find that this distance-based model can capture 1) bursting found in spontaneous activity, 2) ongoing co-fluctuations during stimulus-evoked activity, and 3) adaptation effects during prey-catching behavior.

      Strengths:

      The main strength of the paper is the interpretability of the distance-based model. This model is agnostic to the brain area from which the population of neurons is recorded, making the model broadly applicable to many neuroscientists. I would certainly use this model for any baseline comparisons of trial-to-trial variability.

      The model is assessed in three different contexts, including spontaneous activity and behavior. That the model provides some prediction in all three contexts is a strong indicator that this model will be useful in other contexts, including other model organisms. The model could reasonably be extended to other cognitive states (e.g., spatial attention) or accounting for other neuron properties (such as feature tuning, as mentioned in the manuscript).

      The analyses and intuition to show how the distance-based model explains adaptation were insightful and concise.

      We thank the reviewer for these supportive comments.

      Weaknesses:

      Model evaluation and comparison: The paper does not fully evaluate the model or its assumptions; here, I note details in which evaluation is needed. A key assumption of the model - that correlations fall off in a gaussian manner (Fig. 1C-E - is not supported by Fig. 1C, which appears to have an exponential fall-off. Functions other than gaussian may provide better fits.

      A key feature of our model is that connection strengths smoothly decrease with distance. However, we did not intend to make strong claims about the exact function parametrizing this distance relationship. In light of the reviewer’s comment, we have additionally tested an exponential function and find that it too can describe activity correlations in OT with a negligible decrease in r2 (Figure 1 – figure supplement 1A-C). The main purpose of the analysis was to show that the correlation is maximal around the seed and decays uniformly with distance from it (i.e. no sub-networks or cliques are detected). We have emphasized this in a revised conclusion paragraph and note that while multiple functions can be used to parameterize the relationship, they are nonetheless certainly simplifications. Secondly, we also ran a version of the network simulation where the connections decay in space according to an exponential rather than Gaussian function and show that, as expected, tectal bursting is robust to this change.

      Furthermore, it is not clear whether the r^2s in Fig. 1E are computed in a held-out manner (more details about what goes into computing r^2 are needed).

      These values are computed by fitting the 2-d Gaussian (or exponential function) to all neurons excluding the seed itself (added a short clarification in the Methods).

      Assessing the model based on peak location alone (Fig. 1E) is not sufficient, as other smooth monotonically-decreasing functions may perform similarly.

      As discussed above, an exponential function indeed performs similarly to a Gaussian. However, goodness of fit is secondary to the main aim of Fig 1E, which is to show that the correlation peak tends to fall near the seed cell.

      Simulating from the model greatly improves the reader's understanding (Fig. 2D), but no explanation is given for why the simulations (Fig. 2D) have almost no background spikes and much fewer, non-co-occurring bursts than those of real data (Fig. 2E).

      In part this is because the simulation results depicted in Fig 2D were derived from the ‘baseline model’, prior to optimizing to match biological bursting statistics. It is thus expected that activity will differ from experimental observation and was our main motive to tune the model parameters (now emphasized in the text). However, the model will certainly not account for all aspects of tectal activity; rather, it was designed to reproduce bursting as a prominent feature of ongoing activity and in the second part of the paper we explore the extent to which it can account for other phenomena. As noted above, in the revised abstract, introduction and discussion we have tried to clarify the motivation for developing the model and how it was used to gain insight into activity-dependent changes in network excitability.

      A key assumption of the distance model (Fig. 2A) is that each neuron has the same gaussian fall-off (i.e., sigma_excitation and sigma_inhibition), but it is unclear if the data support this assumption.

      We intentionally opted for a simple model (i.e. described by few parameters), in part due to the lack of connectivity data and additionally to set a lower bound on the extent to which multiple features of tectal activity could be accounted for. More complex models with additional degrees of freedom (such as cell-specific connectivity) may well describe the data better, but likely at the cost of interpretability. We consider such extensions are beyond the scope of the present study but might be fruitful avenues for future research.

      Although an excitatory and inhibitory gain is assumed (Fig. 2A), it is not clear from the data (Fig. 1C) that an inhibitory gain is needed (no negative correlations are observed in Fig. 1C-D).

      This is now explored in the revised Figure 3A which includes the condition of zero inhibition gain. See also response to reviewer 1.

      After optimization (Fig. 3), the model is evaluated on predicting burst properties but not evaluated on predicting held-out responses (R^2s or likelihoods), and no other model (e.g., fitting a GLM or a model with only an excitatory gain) is considered. In particular, one may consider a model in which "assemblies" do exist - does such an assembly model lead to better held-out prediction performance?

      The model we developed is a mechanistic, generative model. In contrast to Pillow et al 2008, we did not fit the model to data but rather we used it to simulate network activity and tuned the seven parameters (using EMOO) to best match biological observations. Thus, rather than assessing goodness-of-fit using cross-validation, our approach involved comparison of summary statistics related to the target emergent phenomenon (tectal bursting). This was necessary as bursting appears highly stochastic. Further to the comments above, we have expanded the parameter space to include instances with only an excitatory gain (where bursting failed) and no distance-dependence (again, busting failed). Introducing assemblies into the model will inevitably support bursting (and introduce many more free parameters), but one of our key observations is that such assemblies are not required for this aspect of spontaneous activity. Again, our aim was not to produce a detailed picture of tectal connectivity, but rather to develop a minimal model and estimate the extent to which it can account for observed features of activity. Note that the second half of the paper (Figure 4 onwards) shows the model can explain phenomena that were not considered during parameter tuning.

      It is unclear why a genetic algorithm (Fig. 1A-C) is necessary versus a grid search; it appears that solutions in Generation 2 (Fig. 3C, leftmost plot, points close to the origin) are as good as solutions in Generation 30 and that the spreads of points across generations do not shrink (as one would expect from better mutations). Given the small number of parameters (7), a grid search is reasonable, computationally tractable, and easier to understand for all readers (Fig. 3A).

      Perhaps in hindsight a grid search would have worked, but at increased computational cost (each instantiation of the model is computationally expansive). At the time we chose EMOO, and since it produced satisfactory results, we kept it. As often happens with multi-objective optimization, an improvement in one objective usually happens at the expense of other objectives, so the spread of the points does not shrink much but they move closer to the axes (i.e. reduced error). The final parameter combination is closer to the origin than any point in generation 2, though admittedly not by much. Importantly, however, optimizing the model using the training features generalized to other burst-related statistics.

      It is unclear why the excitatory and inhibitory gains of the temporal profiles (Fig. 3I) appear to be gaussian but are formulated as exponential (formula for I_ij^X in Methods).

      The interactions indeed have exponential decay in time. These might appear Gaussian because the axis scale is logarithmic.

      Overall, comparing this model to other possible (similar) models and reporting held-out prediction performance will support the claim that the distance model is a good explanation for trial-to-trial variability.

      See comments above. A key point we want to stress is that we intentionally explored a minimal network model and found that, despite obvious simplifications of the biology, it was nonetheless able to explain multiple aspects of tectal physiology and behaviour. We hope that it inspires future studies and can be extended, in parallel to experimental findings, to more accurately represent the cell-type diversity and cell-specific connectivity of the tectal network.

      Data results: Data results were clear and straightforward. However, the explanation was not given for certain results. For example, the relationship between pre-stimulus linear drive and delta R was weak; the examples in Fig. 4C do not appear to be representative of the other sessions. The example sessions in Fig. 4C have R^2=0.17 and 0.19, the two outliers in the R^2 histogram (Fig. 4D).

      The revised figure 4 is based on new data and new analysis (see below), and the presented examples no longer represent the extreme tail of the distribution (they still, however, represent strong examples, as is now explicitly indicated in the figure legend).

      The black trace in Fig. 4D has large variations (e.g., a linear drive of 25 and 30 have a change in delta R of ~0.1 - greater than the overall change of the dashed line at both ends, ~0.08) but the SEMs are very tight. This suggests that either this last fluctuation is real and a major effect of the data (although not present in Fig. 4C) or the SEM is not conservative enough. No null distribution or statistics were computed on the R^2 distribution (Fig. 4C, blue distribution) to confirm the R^2s are statistically significant and not due to random fluctuations.

      We agree that this was not sufficiently robust and in response to this comment we undertook a significant revision to figure 4 and the associated text:

      i) The revised figure is based on an entirely new dataset, allowing us to verify the results on independent data. We used 5 min ISI for all stimulus presentations, regardless of stimulus type (high or low elevation), thus ensuring that we are only examining differences in state brought about by previous ongoing activity, without risk of ‘contamination’ by evoked activity.

      ii) As per the reviewer’s suggestion, we compared model-estimated pre-stimulus state to a null estimate using randomly sampled time-points. We additionally compared the optimised model with the baseline model. Whereas the null (random times) estimates had no predictive power, both models using pre-stimulus activity were able to explain a fraction of the response residuals with the optimised model performing better.

      iii) We refined the binning process by first computing, for each response, the mean of response residuals across neurons for each bin of estimated linear drive, and then averaging across responses. This prevents the relationship being skewed by rare instances involving unusually large numbers of neurons for a particular linear drive bin, and thereby eliminates the fluctuations the reviewer was referring to.

      The absence of any background activity in Fig. 6B (e.g., during the rest blocks) is confusing, given that in spontaneous activity many bursts and background activity are present (Fig. 2E).

      The raster only presents evoked responses and no background activity is shown. This has been clarified in the revised figure and legend.

      Finally, it appears that the anterior optic tectum contributes to convergent saccades (CS) (Fig. 7E) but no post-saccadic activity is shown to assess how activity changes after the saccade (e.g., plotting activity from 0 to 60).

      Activity before and after the saccade is shown in Fig 7A. Fig 7E shows the ‘linear drive’ (or ‘excitability’), and how it changes leading up to the saccade. Since we were interested in the association between pre-saccade state and saccade-associated activity, we did not plot post-saccadic linear drive. However, as can be seen in the below figure for the reviewer, linear drive is strongly suppressed by the saccade, as expected due to CS-associated activity.

      No explanation is given why activity drops ~30 seconds before a convergent saccade (Fig. 7E).

      This is no longer shown after we trimmed the history data in Fig 7E in accordance with a comment from reviewer 1. We speculate, however, that the mean linear drive of a compact population of neurons would be somewhat periodical, since a high linear drive leads to a burst which results in a prolonged inhibition (low linear drive) with a slow recovery and so on.

      No statistical test is performed on the R^2 distribution (Fig. 7H) to confirm the R^2s (with a mean close to R^2=0.01) are meaningful and not due to random fluctuations.

      We revised the analysis in Fig 7 along the same lines as the revision of Fig 4. Model-estimated linear drive predicts CS-associated activity whereas a null estimate (random times) shows no such relationship.

      Presentation: A disjointed part of the paper is that for the first part (Figs. 1-3), the focus is on capturing burst activity, but for the second part (Figs. 4-7), the focus is on trial-to-trial variability with no mention of bursts. It is unclear how the reader should relate the two and if bursts serve a purpose for stimulus-evoked activity.

      In the first part of the paper (Figs. 1-3), we use ongoing activity to develop an understanding (formulated as a network model) of how activity modulates the network state. In the second part, we test this understanding in the context of evoked responses and show that model-estimated network state explains a fraction of visual response variability and experience-dependent changes in activity and behaviour. In the revised MS we further emphasize this idea and have edited the results text to strengthen the connections between these parts of the study. See also comments above.

      Citations: The manuscript may cite other relevant studies in electrophysiology that have investigated noise correlations, such as:

      • Luczak et al., Neuron 2009 (comparing spontaneous and evoked activity).

      • Cohen and Kohn, Nat Neuro 2011 (review on noise correlations).

      • Smith and Kohn, JNeurosci 2008 (looking at correlations over distance).

      • Lin et al., Neuron 2015 (modeling shared variability).

      • Goris et al., Nat Neuro 2014 (check out Fig. 4).

      • Umakantha et al., Neuron 2021 (links noise correlation and dim reduction; includes other recent references to noise correlations).

      We agree that the manuscript could benefit from citing some of these suggested studies and have added citations accordingly.

    1. Author Response

      Reviewer #1 (Public Review):

      It is well established that valuation and value-based decision-making is context-dependent. This manuscript presents the results of six behavioral experiments specifically designed to disentangle two prominent functional forms of value normalization during reward learning: divisive normalization and range normalization. The behavioral and modeling results are clear and convincing, showing that key features of choice behavior in the current setting are incompatible with divisive normalization but are well predicted by a non-linear transformation of range-normalized values.

      Overall, this is an excellent study with important implications for reinforcement learning and decision-making research. The manuscript could be strengthened by examining individual variability in value normalization, as outlined below.

      We thank the Reviewer for the positive appreciation of our work and for the very relevant suggestions. Please find our point-by-point answer below.

      There is a lot of individual variation in the choice data that may potentially be explained by individual differences in normalization strategies. It would be important to examine whether there are any subgroups of subjects whose behavior is better explained by a divisive vs. range normalization process. Alternatively, it may be possible to compute an index that captures how much a given subject displays behavior compatible with divisive vs. range normalization. Seeing the distribution of such an index could provide insights into individual differences in normalization strategies.

      Thank you for pointing this out, it is indeed true that there is some variability. To address this, and in line with the Reviewer’s suggestion, we extracted model attributions per participant on the individual out-of-sample log-likelihood, using the VBA_toolbox in Matlab (Daunizeau et al., 2014). In experiment 1 (presented in the main text), we found that the RANGE model accounted for 79% of the participants, while the DIVISIVE model accounted for 12%. The relative difference was even higher when including the RANGEω model in the model space: the RANGE and RANGEω models account for a total of 85% of the participants, while the DIVISIVE model accounted only for 5%.

      In experiment 2 (presented in the supplementary materials), the results were comparable (see Figure 3-figure supplement 3: 73% vs 10%, 83% vs 2%).

      To provide further insights into the behavioral signatures behind inter-individual differences, we plotted the transfer choice rates for each group of participants (best explained by the RANGE, DIVISIVE, or UNBIASED models), and the results are similar to our model predictions from Figure 1C:

      Author Response Image 1. Behavioral data in the transfer phase, split over participants best explained by the RANGE (left), DIVISIVE (middle) or UNBIASED (right) model in experiment 1 (A) and experiment 2 (B) (versions a, b and c were pooled together).

      To keep things concise, we did not include this last figure in the revised manuscript, but it will be available for the interested readers in the Rebuttal letter.

      One possibility currently not considered by the authors is that both forms of value normalization are at work at the same time. It would be interesting to see the results from a hybrid model. R1.2 Thank you for the suggestion, we fitted and simulated a hybrid model as a weighted sum between both forms of normalization:

      First, the HYBRID model quantitatively wins over the DIVISIVE model (oosLLHYB vs oosLLDIV : t(149)=10.19, p<.0001, d=0.41) but not over the RANGE model, which produced a marginally higher log-likelihood (oosLLHYB vs oosLLRAN : t(149)=-1.82, p=.07, d=-0.008). Second, model simulations also suggest that the model would predict a very similar (if not worse) behavior compared to the RANGE model (see figure below). This is supported by the distribution of the weight parameter over our participants: it appears that, consistently with the model attributions presented above, most participants are best explained by a range-normalization rule (weight > 0.5, 87% of the participants, see figure below). Together, these results favor the RANGE model over the DIVISIVE model in our task.

      Out of curiosity, we also implemented a hybrid model as a weighted sum between absolute (UNBIASED model) and relative (RANGE model) valuations:

      Model fitting, simulations and comparisons slightly favored this hybrid model over the UNBIASED model (oosLLHYB vs oosLLUNB: t(149)=2.63, p=.0094, d=0.15), but also drastically favored the range normalization account (oosLLHYB vs oosLLRAN : t(149)=-3.80, p=.00021, d=-0.40, see Author Response Image 2).

      Author Response Image 2. Model simulations in the transfer phase for the RANGE model (left) and the HYBRID model (middle) defined as a weighted sum between divisive and range forms of normalization (top) and between unbiased (no normalization) and range normalization (bottom). The HYBRID model features an additional weight parameter, whose distribution favors the range normalization rule (right).

      To keep things concise, we did not include this last figure in the revised manuscript, but it will be available for the interested readers in the Rebuttal letter.

      Reviewer #2 (Public Review):

      This paper studies how relative values are encoded in a learning task, and how they are subsequently used to make a decision. This is a topic that integrates multiple disciplines (psych, neuro, economics) and has generated significant interest. The experimental setting is based on previous work from this research team that has advanced the field's understanding of value coding in learning tasks. These experiments are well-designed to distinguish some predictions of different accounts for value encoding. However there is an additional treatment that would provide an additional (strong) test of these theories: RN would make an equivalent set of predictions if the range were equivalently adjusted downward instead (for example by adding a "68" option to "50" and "86", and then comparing to WB and WT). The predictions of DN would differ however because adding a low-value alternative to the normalization would not change it much. Would the behaviour of subjects be symmetric for equivalent ranges, as RN predicts? If so this would be a compelling result, because symmetry is a very strong theoretical assumption in this setting.

      We thank the Reviewer for the overall positive appraisal concerning our work, but also for the stimulating and constructive remarks that we have addressed below. At this stage, we just wanted to mention that we also agree with the Reviewer concerning the fact that a design where we add "68" option to "50" and "86" would represent also an important test of our hypotheses. This is why we had, in fact, run this experiment. Unfortunately, their results were somehow buried in the Supplementary Materials of our original submission and not correctly highlighted in the main text. We modified the manuscript in order to make them more visible:

      Behavioral results in three experiments (N=50 each) featuring a slightly different design, where we added a mid value option (NT68) between NT50 and NT87 converge to the same broad conclusion: the behavioral pattern in the transfer phase is largely incompatible with that predicted by outcome divisive normalization during the learning phase (Figure 2-figure supplement 2).

      Reviewer #3 (Public Review):

      Bavard & Palminteri extend their research program by devising a task that enables them to disassociate two types of normalisation: range normalisation (by which outcomes are normalised by the min and max of the options) and divisive normalisation (in which outcomes are normalised by the average of the options in ones context). By providing 4 different training contexts in which the range of outcomes and number of options vary, they successfully show using 'ex ante' simulations that different learning approaches during training (unbiased, divisive, range) should lead to different patterns of choice in a subsequent probe phase during which all options from the training are paired with one another generating novel choice pairings. These patterns are somewhat subtle but are elegantly unpacked. They then fit participants' training choices to different learning models and test how well these models predict probe phase choices. They find evidence - both in terms of quantitive (i.e. comparing out-of-sample log-likelihood scores) and qualitative (comparing the pattern of choices observed to the pattern that would be observed under each mode) fit - for the range model. This fit is further improved by adding a power parameter which suggests that alongside being relativised via range normalisation, outcomes were also transformed non-linearly.

      I thought this approach to address their research question was really successful and the methods and results were strong, credible, and robust (owing to the number of experiments conducted, the design used and combination of approaches used). I do not think the paper has any major weaknesses. The paper is very clear and well-written which aids interpretability.

      This is an important topic for understanding, predicting, and improving behaviour in a range of domains potentially. The findings will be of interest to researchers in interdisciplinary fields such as neuroeconomics and behavioural economics as well as reinforcement learning and cognitive psychology.

      We thank Prof. Garrett for his positive evaluation and supportive attitude.

    1. Author Response

      Reviewer #1 (Public Review):

      While the mechanism about arm-races between plant and specialist herbivores has been studied, such as detoxification of specific secondary metabolites, the mechanism of the wider diet breadth, so-called generalist herbivores have been less studied. Since the heterogeneity of host plant species, the experimental validation of phylogenetic generalism of herbivores seemed as hard to be conducted. The authors declared the two major hypotheses about the large diet breadth ("metabolic generalism" and "multi-host metabolic specialism"), and carefully designed the experiment using Drosophila suzukii as a model herbivore species.

      By an untargeted metabolomics approach using UHPLC-MS, authors attempted to falsify the hypotheses both in qualitative- and quantitative metabolomic profiles. Intersections of four fruit (puree) samples and each diet-based fly individual samples from the qualitative data revealed that there were few ions that occur as the specific metabolite in each diet-based fly group, which could reject the "multi-host metabolic specialism" hypothesis. Quantitative data also showed results that could support the "metabolic generalism" hypothesis. Therefore, the wide diet breadth of D. suzukii seemed to be derived from the general metabolism rather than the adaptive traits of the diverse host plant species. On the other hand, the reduction of the metabolites (ions) set using GLM seemed logical and 2-D clustering from the reduced ions set showed that quantitative aspects of diet-associated ions could classify "what the flies ate". These interesting results could enhance the understanding of the diet breadth (niche) of herbivorous insects.

      The authors' approach seemed clear to falsify the hypotheses based on the appropriate data processing. The intersection of shared ions from the qualitative dataset could distinguish the diet-specific metabolites in flies and commonly occurring metabolites among flies and/or fruits. Also, filtering on the diet-specific ions seemed to be a logical and appropriate way. Meanwhile, the discussion about the results seemed to be focused on different points regarding the research hypotheses which were raised in the introduction part. Discussion about the results mainly focused on the metabolism of D. suzukii itself, rather than the research hypotheses and questions that were raised from the evolution of the wide diet breadth of generalist herbivores. In particular, the conclusion seems to be far from the main context of the authors' research; e.g. frugivory. It makes the implication of the study weaker.

      We wish to thank Reviewer #1 for their appreciation of our study. As recommended, we now focus our discussion more on the general aspect of our findings (relevant to insects, herbivores, or frugivores), and less on the peculiarities of the metabolism of D. suzukii itself. Specifically, we now only mention D. suzukii in one section (two sentences) of our Discussion, to serve as an example (l.387-396). Thanks to this comment, the Discussion may interest a broader readership, on the evolution of diet breadth in generalist herbivorous species and offers a better understanding of the general implications of our findings.

      Reviewer #2 (Public Review):

      The manuscript: "Metabolic consequences of various fruit-based diets in a generalist insect species" by Olazcuaga et al., addresses an interesting question. Using an untargeted metabolomics approach, the authors study how diet generalism may have evolved versus diet specialization which is generally more commonly observed, at least in drosophila species. Using the phytophagous species Drosophila suzukii, and by directly comparing the metabolomes of fruit purees and the flies that fed on them, the authors found evidence for "metabolic generalism". Metabolic generalism means that individuals of a generalist species process all types of diet in a similar way, which is in contrast to "multi-host metabolic specialism" which entails the use of specific pathways to metabolize unique compounds of different diets. The authors find strong evidence for the first hypothesis, as they could easily detect the signature of each fruit diet in the flies. The authors then go on to speculate on the evolutionary ramifications of this for how potentially diet specializations may have evolved from diet generalism. Overall, the paper is well written, the experiments well documented, and the conclusions convincing.

      We thank Reviewer #2 for their comments and appreciation of our work.

      Reviewer #3 (Public Review):

      Laure Olazcuaga et al. investigated the metabolomes of four fruit-based diets and corresponding individuals of Drosophila suzukii that reared on them using comparative metabolomics analysis. They observed that the four fruit-based diets are metabolically dissimilar. On the contrary, flies that fed on them are mostly similar in their metabolic response. From a quantitative point of view, they find that part of the fly metabolomes correlates well with that of the corresponding diet metabolomes, which is indicative of insect ingestive history. By further focusing on 71 metabolites derived from diet-specific fly ions and highly abundant fruit ions, the authors show that D. suzukii differentially accumulates diet metabolism in a compound-specific manner. The authors claim that the data support the metabolic generalism hypothesis while rejecting the multi-host metabolic specialism hypothesis. This study provides a valuable global chemical comparison of how diverse diet metabolites are processed by a generalist insect species.

      Strengths:

      The rapid advances in high-resolution mass spectrometry have recently accelerated the discovery of many novel post-ingestive compounds through comparative metabolomics analysis of insect/frass and plant samples. Untargeted metabolomics is thus a very powerful approach for the systematic comparison of global chemical shifts when diverse plant-derived specialized metabolites are further modified or quantitatively metabolized after ingestion by insects. The technique can be readily extended to a larger micro- or macro-evolutionary context for both generalist and specialist insects to systematically investigate how plant chemical diversity contributes to dietary generalism and specialism.

      We would like to thank Reviewer #3 for their insightful comments on the power of untargeted metabolomics to evaluate the fate of plant metabolites and their use by herbivores. We also agree that these techniques can be used to tackle eco-evolutionary issues, such as the origin and maintenance of dietary generalism and specialism here. We hope that our study will inspire other researchers to explore such techniques and experiments to gain a global overview of biochemistry fluxes and their evolution. We now mention it in the conclusion (L454-459).

      Weaknesses:

      The authors claim that their data support the hypothesis of metabolic generalism, however, a total analysis of insect metabolism may not generate a clean dataset for direct comparison of fruit-derived metabolites with those metabolized by D. suzukii, given that much of these metabolites would be "diluted" proportionally by insect-derived metabolites. If the insect-derived metabolites predominate, then, as the authors observed, a tight clustering of D. suzukii metabolomes in the PCA plot would be expected. It is therefore very difficult to interpret these patterns.

      We agree with Reviewer #3 that a careful examination of the different possible origins of metabolites should take place to distinguish between our two competing hypotheses.

      The only source of metabolites for insects in our experimental setup is a mixture of (i) a large proportion of fruit purees and (ii) a minor proportion of artificial medium consisting mainly of yeast. Our goal is thus to understand the fate of (i) “fruit-derived” metabolites (transformed and untransformed), while controlling for (ii) “artificial media-derived” metabolites, that constitute a nuisance signal but are necessary for a complete development in our system.

      By “fruit-derived” and “insect-derived” metabolites, it is our understanding that Reviewer #3 means “fruit” metabolites (when in insects, untransformed “fruit-derived” metabolites) and “artificial medium-derived” metabolites. It is true that we do wish to avoid a predominance of “artificial medium-derived” metabolites and focus on “fruit-derived” metabolites in insects. We also want to note that it is of primary importance in our study to distinguish between “fruit” metabolites that are carried as is (“fruit” metabolites present in insects, ie untransformed “fruit-derived” metabolites), and “fruit” metabolites that are used after transformation by the insect (i.e., transformed “fruit-derived” metabolites).

      We agree with Reviewer #3 that the presence of “artificial medium-derived” metabolites could be problematic in direct comparisons of fruits and insects (and not among fruits or among insects’ comparisons).

      However, we took some steps to avoid such problems:

      1. We included control fly samples in our experiment: at each experimental generation, flies developed only on artificial medium (without fruit puree) were collected and processed simultaneously with flies that developed on fruit media. Results using these artificial medium-reared flies as controls (by subtracting their ions levels and removing ions that were similar, respective of their generation) were similar to results using raw data and conclusions were identical (see below).

      2. We lowered the proportion of artificial medium in our fruit media so that it was kept to a minimum, compatible with larval development and adult survival.

      Consistent with the low impact of this “artificial medium” component on our conclusions, we also wish to point out the presence pattern of metabolites found only in flies and never in fruits when using raw data (Figure 3, yellow stack). Even in the most conservative hypothesis of 100% of these metabolites originating from our artificial medium (which is probably not the case), we observe that it constitutes only a minor proportion of metabolites common to all flies (15.7%).

      For your consideration, we include below the main Figures, using both raw data and artificial medium-controlled:

      Figure 2, left = raw data; right = artificial-media controlled:

      Figure 3, left = raw data; right = artificial-media controlled:

      Figure 3S1, left = raw data; right = artificial-media controlled:

      Figure 4, above = raw data; below = artificial-media controlled:

      We hope that we convinced the Editor/Reviewers that raw data and artificial-medium controlled data provide a single and same answer to all our analyses. We chose to present only raw data, to simplify the Materials & Methods section.

      We however modified the current version of the manuscript to inform the reader that proper controls were done and that their inclusion do not modify any of our conclusions (l.110-113 and l.583-589).

      We also wish to point out two additional comments:

      • As Reviewer #1 also recommended, we modified the expectations drawn in Fig1G to better consider the general comment of “insect derived” metabolites being fundamentally different from plant metabolites (even if we do show in our study that only approx. 9% of metabolites are private to flies).

      • The main part of our care in the use of this global PCA analysis is that it follows two other analyses (global intersection and comparison of intersections among fruits and among flies) and precedes another one (fly-focused PCA). We hope that all these analyses help the readers get a comprehensive overview of the dataset and associated results, avoiding reliance on a single analysis.

      • We also help readers to explore and visualize all analyses presented in our manuscript by setting up a shiny application (in addition to our available dataset and R code), at https://fruitfliesmetabo.shinyapps.io/shiny/. This is now mentioned in the main text (l.588-589).

      We thank the Reviewer for their comment that greatly improved the manuscript.

      The authors generated a qualitative dataset using the peak list produced by XCMS which contains quantitative peak areas, it is unclear how the threshold was selected to determine if a peak is present or absent in a given sample. The qualitative dataset would influence the output of their data analysis.

      The referee is right in pointing out that the threshold used to determine if a peak is present or absent in a given sample was not clearly specified. This has now been corrected in the “Host use” section of the Materials & Methods (l.513-516). Briefly, a given replicate of a compound was considered present if the corresponding peak area following XCMS quantification was > 1000. This threshold was selected to be close to the practical quantification threshold of the Thermo Exactive mass spectrometer used in this study. This threshold was selected in order to allow the quantification of low-abundance compounds, as many plant-derived diet compounds were expected to be present in trace amounts in flies. We additionally applied a stringent rule for presence of any given compound (presence in at least 3 biological replicates).

      The authors reply on in-source fragmentation for peak annotation when authentic standards are not available. The accuracy of the annotation thus requires further validation.

      The Supplementary Table 1 was unfortunately omitted in the first submission of the manuscript. This oversight has been now corrected and the Supplementary Table 1 details all information used for metabolite annotation. In particular, MS/MS data comparison with mass spectral databases as well as with published literature have been added to substantiate metabolite identifications. This MS/MS data was produced thanks to the comment of the Reviewer. We also provide four more annotations from standards to attain 30 / 71 identifications validated through chemical standards.

    1. Author Response

      Reviewer #1 (Public Review):

      Overall, the science is sound and interesting, and the results are clearly presented. However, the paper falls in-between describing a novel method and studying biology. As a consequence, it is a bit difficult to grasp the general flow, central story and focus point. The study does uncover several interesting phenomena, but none are really studied in much detail and the novel biological insight is therefore a bit limited and lost in the abundance of observations. Several interesting novel interactions are uncovered, in particular for the SPS sensor and GAPDH paralogs, but these are not followed up on in much detail. The same can be said for the more general observations, eg the fact that different types of mutations (missense vs nonsense) in different types of genes (essential vs non-essential, housekeeping vs. stress-regulated...) cause different effects.

      This is not to say that the paper has no merit - far from it even. But, in its current form, it is a bit chaotic. Maybe there is simply too much in the paper? To me, it would already help if the authors would explicitly state that the paper is a "methods" paper that describes a novel technique for studying the effects of mutations on protein abundance, and then goes on to demonstrate the possibilities of the technology by giving a few examples of the phenomena that can be studied. The discussion section ends in this way, but it may be helpful if this was moved to the end of the introduction.

      We modified the manuscript as suggested.

      Reviewer #2 (Public Review):

      Schubert et al. describe a new pooled screening strategy that combines protein abundance measurements of 11 proteins determined via FACS with genome-wide mutagenesis of stop codons and missense mutations (achieved via a base editor) in yeast. The method allows to identify genetic perturbations that affect steady state protein levels (vs transcript abundance), and in this way define regulators of protein abundance. The authors find that perturbation of essential genes more often alters protein abundance than of nonessential genes and proteins with core cellular functions more often decrease in abundance in response to genetic perturbations than stress proteins. Genes whose knockouts affected the level of several of the 11 proteins were enriched in protein biosynthetic processes while genes whose knockouts affected specific proteins were enriched for functions in transcriptional regulation. The authors also leverage the dataset to confirm known and identify new regulatory relationships, such as a link between the SDS amino acid sensor and the stress response gene Yhb1 or between Ras/PKA signalling and GAPDH isoenzymes Tdh1, 2, and 3. In addition, the paper contains a section on benchmarking of the base editor in yeast, where it has not been used before.

      Strengths and weaknesses of the paper

      The authors establish the BE3 base editor as a screening tool in S. cerevisiae and very thoroughly benchmark its functionality for single edits and in different screening formats (fitness and FACS screening). This will be very beneficial for the yeast community.

      The strategy established here allows measuring the effect of genetic perturbations on protein abundances in highly complex libraries. This complements capabilities for measuring effects of genetic perturbations on transcript levels, which is important as for some proteins mRNA and protein levels do not correlate well. The ability to measure proteins directly therefore promises to close an important gap in determining all their regulatory inputs. The strategy is furthermore broadly applicable beyond the current study. All experimental procedures are very well described and plasmids and scripts are openly shared, maximizing utility for the community.

      There is a good balance between global analyses aimed at characterizing properties of the regulatory network and more detailed analyses of interesting new regulatory relationships. Some of the key conclusions are further supported by additional experimental evidence, which includes re-making specific mutations and confirming their effects on protein levels by mass spectrometry.

      The conclusions of the paper are mostly well supported, but I am missing some analyses on reproducibility and potential confounders and some of the data analysis steps should be clarified.

      The paper starts on the premise that measuring protein levels will identify regulators and regulatory principles that would not be found by measuring transcripts, but since the findings are not discussed in light of studies looking at mRNA levels it is unclear how the current study extends knowledge regarding the regulatory inputs of each protein.

      See response to Comment #10.

      Specific comments regarding data analysis, reproducibility, confounders

      1) The authors use the number of unique barcodes per guide RNA rather than barcode counts to determine fold-changes. For reliable fold changes the number of unique barcodes per gRNA should then ideally be in the 100s for each guide, is that the case? It would also be important to show the distribution of the number of barcodes per gRNA and their abundances determined from read counts. I could imagine that if the distribution of barcodes per gRNA or the abundance of these barcodes is highly skewed (particularly if there are many barcodes with only few reads) that could lead to spurious differences in unique barcode number between the high and low fluorescence pool. I imagine some skew is present as is normal in pooled library experiments. The fold-changes in the control pools could show whether spurious differences are a problem, but it is not clear to me if and how these controls are used in the protein screen.

      Because of the large number of screens performed in this study (11 proteins, with 8 replicates for each) we had to trade off sequencing depth and power against cell sorting time and sequencing cost, resulting in lower read and barcode numbers than what might be ideally aimed for. As described further in the response to Comment #5, we added a new figure to the manuscript that shows that the correlation of fold-changes between replicates is high (Figure 3–S1A). The second figure below shows that the correlation between the number of unique barcodes and the number of reads per gRNA is highly significant (p < 2.2e-16).

      2) I like the idea of using an additional barcode (plasmid barcode) to distinguish between different cells with the same gRNA - this would directly allow to assess variability and serve as a sort of replicate within replicate. However, this information is not leveraged in the analysis. It would be nice to see an analysis of how well the different plasmid barcodes tagging the same gRNA agree (for fitness and protein abundance), to show how reproducible and reliable the findings are.

      We agree with the reviewer that this would be nice to do in principle, but our sequencing depth for the sorted cell populations was not high enough to compare the same barcode across the low/unsorted/high samples. See also our response to Comment #5 for the replicate analyses.

      3) From Fig 1 and previous research on base editors it is clear that mutation outcomes are often heterogeneous for the same gRNA and comprise a substantial fraction of wild-type alleles, alleles where only part of the Cs in the target window or where Cs outside the target window are edited, and non C-to-T edits. How does this reflect on the variability of phenotypic measurements, given that any barcode represents a genetically heterogeneous population of cells rather than a specific genotype? This would be important information for anyone planning to use the base editor in future.

      We agree with the reviewer that the heterogeneity of editing outcomes is an important point to keep in mind when working with base editors. In genetic screens, like the ones described here, often the individual edit is less important, and the overall effects of the base editor are specific/localized enough to obtain insights into the effects of mutations in the area where the gRNA targets the genome. For example, in our test screens for Canavanine resistance and fitness effects, in which we used gRNAs predicted to introduce stop codons into the CAN1 gene and into essential genes, respectively, we see the expected loss-of-function effect for a majority of the gRNAs (canavanine screen: expected effect for 67% of all gRNAs introducing stop codons into CAN1; fitness screen: expected effect for 59% of all gRNAs introducing stop codons into essential genes) (Figure 2). In the canavanine screen, we also see that gRNAs predicted to introduce missense mutations at highly conserved residues are more likely to lead to a loss-of-function effect than gRNAs predicted to introduce missense mutations at less conserved residues, further highlighting the differentiated results that can be obtained with the base editor despite the heterogeneity in editing outcomes overall. We would certainly advise anyone to confirm by sequencing the base edits in individual mutants whenever a precise mutation is desired, as we did in this study when following up on selected findings with individual mutants.

      4) How common are additional mutations in the genome of these cells and could they confound the measured effects? I can think of several sources of additional mutations, such as off-target editing, edits outside the target window, or when 2 gRNA plasmids are present in the same cell (both target windows obtain edits). Could some of these events explain the discrepancy in phenotype for two gRNAs that should make the same mutation (Fig S4)? Even though BE3 has been described in mammalian cells, an off-target analysis would be desirable as there can be substantial differences in off-target behavior between cell types and organisms.

      Generally, we are not very concerned about random off-target activity of the base editor because we would not expect this to cause a consistent signal that would be picked up in our screen as a significant effect of a particular gRNA. Reproducible off-target editing with a specific gRNA at a site other than the intended target site would be problematic, though. We limited the chance of this happening by not using gRNAs that may target similar sequences to the intended target site in the genome. Specifically, we excluded gRNAs that have more than one target in the genome when the 12 nucleotides in the seed region (directly upstream of the PAM site) are considered (DiCarlo et al., Nucleic Acids Research, 2013).

      We do observe some off-target editing right outside the target window, but generally at much lower frequency than the on-target editing in the target window (Figure 1B and Figure 1–S2). Since for most of our analyses we grouped perturbations per gene, such off-target edits should not affect our findings. In addition, we validated key findings with independent experiments. For our study, we used the Base Editor v3 (Komor et al., Nature, 2016); more recently, additional base editors have been developed that show improved accuracy and efficiency, and we would recommend these base editors when starting a new study (see, e.g., Anzalone et al., Nature Biotechnology, 2020).

      We are not concerned about cases in which one cell gets two gRNAs, since the chance that the same two gRNAs end up in one cell repeatedly is low, and such events would therefore not result in a significant signal in our screens.

      We don’t think that off-target mutations can explain the discrepancy between pairs of gRNAs that should introduce the same mutation (Figure 3–S1. The effect of the two gRNAs is actually well-correlated, but, often, one of the two gRNAs doesn’t pass our significance cut-off or simply doesn’t edit efficiently (i.e., most discrepancies arise from false negatives rather than false positives). We may therefore miss the effects of some mutations, but we are unlikely to draw erroneous conclusions from significant signals.

      5) In the protein screen normalization uses the total unique barcode counts. Does this efficiently correct for differences from sequencing (rather than total read counts or other methods)? It would be nice to see some replicate plots for the analysis of the fitness as well as the protein screen to be able to judge that.

      We made a new figure that shows a replicate comparison for the protein screen (see below; in the manuscript it is Figure 3–S1A) and commented on it in the manuscript. For this analysis, the eight replicates for each protein were split into two groups of four replicates each and analyzed the same way as the eight replicates. The correlation between the two groups of replicates is highly significant (p < 2.2e-16). The second figure shows that the total number of reads and the total number of unique barcodes are well correlated.

      For the fitness screen, we used read counts rather than barcode counts for the analysis since read counts better reflect the dropout of cells due to reduced fitness. The figure below shows a replicate comparison for the fitness screen. For this analysis, the four replicates were split into two groups of two replicates each and analyzed the same way as the four replicates. The correlation between the two groups of replicates is highly significant (p < 2.2e-16).

      6) In the main text the authors mention very high agreement between gRNAs introducing the same mutation but this is only based on 20 or so gRNA pairs; for many more pairs that introduce the same mutation only one reaches significance, and the correlation in their effects is lower (Fig S4). It would be better to reflect this in the text directly rather than exclusively in the supplementary information.

      We clarified this in the manuscript main text: “For 78 of these gRNA pairs, at least one gRNA had a significant effect (FDR < 0.05) on at least one of the eleven proteins; their effects were highly correlated (Pearson’s R2 = 0.43, p < 2.2E-16) (Figure 3–S1B). For the 20 gRNA pairs for which both gRNAs had a significant effect, the correlation was even higher (Pearson’s R2 = 0.819, p = 8.8e-13) (Figure 3–S1C). These findings show that the significant gRNA effects that we identify have a low false positive rate, but they also suggest that many real gRNA effects are not detected in the screen due to limitations in statistical power.”

      7) When the different gRNAs for a targeted gene are combined, instead of using an averaged measure of their effects the authors use the largest fold-change. This seems not ideal to me as it is sensitive to outliers (experimental error or background mutations present in that strain).

      We agree that the method we used is more sensitive to outliers than averaging per gene. However, because many gRNAs have no effect either because they are not editing efficiently or because the edit doesn’t have a phenotypic consequence, an averaging method across all gRNAs targeting the same gene would be too conservative and not properly capture the effect of a perturbation of that gene.

      8) Phenotyping is performed directly after editing, when the base editor is still present in the cells and could still interact with target sites. I could imagine this could lead to reduced levels of the proteins targeted for mutagenesis as it could act like a CRISPRi transcriptional roadblock. Could this enhance some of the effects or alter them in case of some missense mutations?

      To reduce potential “CRISPRi-like” effects of the base editor on gene expression, we placed the base editor under a galactose-inducible promoter. For both the fitness and protein screens we grew the cultures in media without galactose for another 24 hours (fitness screen) or 8-9 hours (protein screens) before sampling. In the latter case, this recovery time corresponded to more than three cell divisions, after which we assume base editor levels to have strongly decreased, and therefore to no longer interfere with transcription. This is also supported by our ability to detect discordant effects of gRNAs targeting the same gene (e.g., the two mutations leading to loss-of-function and gain-of-function of RAS2), which would otherwise be overshadowed by a CRISPRi effect.

      9) I feel that the main text does not reflect the actual editing efficiency very well (the main numbers I noticed were 95% C to T conversion and 89% of these occurring in a specific window). More informative for interpreting the results would be to know what fraction of the alleles show an edit (vs wild-type) and how many show the 'complete' edit (as the authors assume 100% of the genotypes generated by a gRNA to be conversion of all Cs to Ts in the target window). It would be important to state in the main text how variable this is for different gRNAs and what the typical purity of editing outcomes is.

      We now show the editing efficiency and purity in a new figure (Figure 1B), and discuss it in the main text as follows: “We found that the target window and mutagenesis pattern are very similar to those described in human cells: 95% of edits are C-to-T transitions, and 89% of these occurred in a five-nucleotide window 13 to 17 base pairs upstream of the PAM sequence (Figure 1A; Figure 1–S2) (Komor et al., 2016). Editing efficiency was variable across the eight gRNAs and ranged from 4% to 64% if considering only cases where all Cs in the window are edited; percentages are higher if incomplete edits are considered, too (Figure 1B).”

      Comments regarding findings

      10) It would be nice to see a comparison of the results to the effects of ~1500 yeast gene knockouts on cellular transcriptomes (https://doi.org/10.1016/j.cell.2014.02.054). This would show where the current study extends established knowledge regarding the regulatory inputs of each protein and highlight the importance of directly measuring protein levels. This would be particularly interesting for proteins whose abundance cannot be predicted well from mRNA abundance.

      We agree with the reviewer that it would be very interesting to compare the effect of perturbations on mRNA vs protein levels. We have compared our protein-level data to mRNA-level data from Kemmeren and colleagues (Kemmeren et al., Cell 2014), and we find very good agreement between the effects of gene perturbations on mRNA and protein levels when considering only genes with q < 0.05 and Log2FC > 0.5 in both studies (Pearson’s R = 0.79, p < 5.3e-15).

      Gene perturbations with effects detected only on mRNA but not protein levels are enriched in genes with a role in “chromatin organization” (FDR = 0.01; as a background for the analysis, only the 1098 genes covered in both studies were considered). This suggests that perturbations of genes involved in chromatin organization tend to affect mRNA levels but are then buffered and do not lead to altered protein levels. There was no enrichment of functional annotations among gene perturbations with effects on protein levels but not mRNA levels.

      We did not include these results in the manuscript because there are some limitations to the conclusions that can be drawn from these comparisons, including that our study has a relatively high number of false negatives, and that the genes perturbed in the Kemmeren et al. study were selected to play a role in gene regulation, meaning that differences in mRNA-vs-protein effects of perturbations are limited to this function, and other gene functions cannot be assessed.

      11) The finding that genes that affect only one or two proteins are enriched for roles in transcriptional regulation could be a consequence of 'only' looking at 10 proteins rather than a globally valid conclusion. Particularly as the 10 proteins were selected for diverse functions that are subject to distinct regulatory cascades. ('only' because I appreciate this was a lot of work.)

      We agree with this, and we think it is clear in the abstract and the main text of the manuscript that here we studied 11 proteins. We made this point also more explicit in the discussion, so that it is clear for readers that the findings are based on the 11 proteins and may not extrapolate to the entire yeast proteome.

      Reviewer #3 (Public Review):

      This manuscript presents two main contributions. First, the authors modified a CRISPR base editing system for use in an important model organism: budding yeast. Second, they demonstrate the utility of this system by using it to conduct an extremely high throughput study the effects of mutation on protein abundance. This study confirms known protein regulatory relationships and detects several important new ones. It also reveals trends in the type of mutations that influence protein abundances. Overall, the findings are of high significance and the method appears to be extremely useful. I found the conclusions to be justified by the data.

      One potential weakness is that some of the methods are not described in main body of the paper, so the reader has to really dive into the methods section to understand particular aspects of the study, for example, how the fitness competition was conducted.

      We expanded the first section for better readability.

      Another potential weakness is the comparison of this study (of protein abundances) to previous studies (of transcript abundances) was a little cursory, and left some open questions. For example, is it remarkable that the mutations affecting protein abundance are predominantly in genes involved in translation rather than transcription, or is this an expected result of a study focusing on protein levels?

      We thank the reviewer for pointing out that this paragraph requires more explanation. We expanded it as follows: “Of these 29 genes, 21 (72%) have roles in protein translation—more specifically, in ribosome biogenesis and tRNA metabolism (FDR < 8.0e-4, Figure 5C). In contrast, perturbations that affect the abundance of only one or two of the eleven proteins mostly occur in genes with roles in transcription (e.g., GO:0006351, FDR < 1.3e-5). Protein biosynthesis entails both transcription and translation, and these results suggest that perturbations of translational machinery alter protein abundance broadly, while perturbations of transcriptional machinery can tune the abundance of individual proteins. Thus, genes with post-transcriptional functions are more likely to appear as hubs in protein regulatory networks, whereas genes with transcriptional functions are likely to show fewer connections.”

      Overall, the strengths of this study far outweigh these weaknesses. This manuscript represents a very large amount of work and demonstrates important new insights into protein regulatory networks.

    1. Author Response

      Reviewer #2 (Public Review):

      The authors seek to determine how various species combine their effects on the growth of a species of interest when part of the same community.

      To this end, the authors carry out an impressive experiment containing what I believe must be one of the largest pairwise + third-order co-culture experiments done to date, using a high-throughput co-culture system they had co-developed in previous work. The unprecedented nature of this data is a major strength of the paper. The authors also discover that species combine their effect through "dominance", i.e. the strongest effect masks the others. This is important as it calls into question the common assumption of additivity that is implicit in the choice of using Lotka-Volterra models.

      A stronger claim (i.e. in the abstract) is that joint effect of multiple species on the growth of another can be derived from the effect of individual species. Unless I am misunderstanding something, this statement may have to be qualified a little, as the authors show that a model based on pairwise dominance (i.e. the strongest pairwise) does a somewhat better job (lower RMSD, though granted, not by much, 0.57 vs 0.63) than a model based on single species dominance. This is, the effect of the strongest pair predicts better the effect of a trio than the effect of the larger species.

      This issue makes one wonder whether, had the authors included higher-order combinations of species (i.e. five-member consortia or higher), the strongest-effect trio would have predicted better than the strongest-effect pair, which in turn is better predictor than the strongest-effect species. This is important, as it would help one determine to what extent the strongest-effect model would work in more diverse communities, such as those one typically finds in nature. Indeed, the authors find that the predictive ability of the strongest effect species is much stronger for pairs than it is for trios (RMSD of 0.28 vs 0.63). Does the predictive ability of the single species model decline faster and faster as diversity grows beyond 4-member consortia?

      Thank you for raising this important point. It is true that in our study we see that single species predict pairs better than trios, and that pairs predict trios better than single species. As we did not perform experiments on more diverse communities (n>4), we are not sure if or how these rules will scale up. We explicitly address these caveats in our revised discussion.

      Reviewer #3 (Public Review):

      A problem in synthetic ecology is that one can't brute-force complex community design because combinatorics make it basically impossible to screen all possible communities from a bank of possible species. Therefore, we need a way to predict phenomena in complex communities from phenomena in simple communities. This paper aims to improve this predictive ability by comparing a few different simple models applied to a large dataset obtained with the use of the author's "kchip" microfluidics device. The main question they ask is whether the effect of two species on a focal species is predicted from the mean, the sum, or the max of the effect of each single "affecting" species on the focal species. They find that the max effect is often the best predictor, in the sense of minimizing the difference between predicted effect and measured effect. They also measure single-species trait data for their library of strains, including resource niche and antibiotic resistance, and then find that Pearson correlations between distance calculations generated from these metrics and the effect of added species are weak and unpredictive. This work is largely well-done, timely and likely to be of high interest to the field, as predicting ecosystem traits from species traits is a major research aim.

      My main criticism is that the main take-home from the paper (fig 3B)-that the strongest effect is the best predictor-is oversold. While it is true that, averaged over their six focal species, the "strongest effect" was the best overall predictor, when one looks at the species-specific data (S9), we see that it is not the best predictor for 1/3 of their focal species, and this fraction grows to 1/2 if one considers a difference in nRMSE of 0.01 to be negligible.

      As suggested, we have softened our language regarding the take-home message. This matter is addressed in detail above in response to 'Essential Revisions'. Briefly, we see that the strongest model works best when both single species have qualitatively similar effects, but is slightly less accurate when effects are mixed. We also see overall less accurate predictions for positive effects. In light of these findings, we propose that focal species for which the strongest model is not the most accurate is due to the interaction types, and not specific to the focal species.

      We made substantial changes to the manuscript, including the first paragraph of the discussion which more accurately describes these findings and emphasizes the relevant caveats:

      "By measuring thousands of simplified microbial communities, we quantified the effects of single species, pairs, and trios on multiple focal species. The most accurate model, overall and specifically when both single species effects were negative, was the strongest effect model. This is in stark contrast to models often used in antibiotic compound combinations, despite most effects being negative, where additivity is often the default model (Bollenbach 2015). The additive model performed well for mixed effects (i.e. one negative and one positive), but only slightly better than the strongest model, and poorly when both species had effects of the same sign. When both single species’ effects were positive, the strongest model was also the best, though the difference was less pronounced and all models performed worse for these interactions. This may be due to the small effect size seen with positive effects, as when we limited negative and mixed effects to a similar range of effects strength, their accuracy dropped to similar values (Figure 3–Figure supplement 5). We posit that the difference in accuracy across species is affected mainly by the effect type dominating different focal species' interactions, rather than by inherent species traits (Figure 3–Figure supplement 6)." (Lines 288-304)

      The same criticism applies to the result from figure 2-that pairs of affecting species have more negative effects than single species. Considered across all focal species this is true (though minor in effect size, Fig 2A). But there is only a significant effect within two individual species. Again, this points to the effects being focal-species-specific, and perhaps not as generalizable as is currently being claimed.

      Upon more rigorous analysis, and with regard to changes in the dataset after filtering, we see that the more accurate statement is that effects become stronger, not necessarily more negative (in line with the accuracy of the strongest model). The overall trend is towards more negative interactions, due to the majority of interactions being negative, but as stated this is not true for each individual focal. As such the following sentence in the manuscript has been changed:

      "The median effect on each focal was more negative by 0.28 on average, though the difference was not significant in all cases; additionally, focals with mostly positive single species interactions showed a small increase in median effect (Fig. 2D)" (Lines 151-154)

      As well as the title of this section: "Joint effects of species pairs tend to be stronger than those of individual affecting species" (Lines 127-128)

      Another thing that points to a focal-species-specific response is Fig 2D, which shows the distributions of responses of each focal species to pairs. Two of these distributions are unimodal, one appears bimodal, and three appear tri-modal. This suggests to me that the focal species respond in categorically different ways to species addition.

      We believe this distribution of pair effects is related to the distribution of single species effects, and not to the way in which different focal species respond to the addition of second species. Though this may be difficult to see from the swarm plots shown in the paper, below is a split violin plot that emphasizes this point.

      Fig R1: Distribution of single species and pair effects. Distribution of the effect of single and pairs of affecting species for each focal species individually. Dashed lines represent the median, while dotted lines the interquartile range.

      These differences occur even though the focal bacteria are all from the same family. This suggests to me that the generalizability may be even less when a more phylogenetically dispersed set of focal species are used.

      We have added the following sentence to the discussion explicitly emphasizing the phylogenetic limitations of our study:

      "Lastly, it is important to note that our focal species are all from the same order (Enterobacterales), which may also limit the purview of our findings." (Lines 364-366)

      Considering these points together, I argue that the conclusion should be shifted from "strongest effect is the best" to "in 3 of our focal species, strongest effect was the best, but this was not universal, and with only 6 focal species, we can't know if it will always be the best across a set of focal species".

      As mentioned above, we have softened our language regarding the take-home message in response to these evaluations.

      My second main criticism is that it is hard to understand exactly how the trait data were used to predict effects. It seems like it was just pearson correlation coefficients between interspecies niche distances (or antibiotic distances) and the effect. I'm not very surprised these correlations were unpredictive, because the underlying measurements don't seem to be relevant to the environment tested. What if, rather than using niche data across 20 nutrients, only the growth data on glucose (the carbon source in the experiments) was used? I understand that in a field experiment, for example, one might not know what resources are available, and so measuring niche across 20 resources may be the best thing to do. Here though it seems imperative to test using the most relevant data.

      It is true that much of the profiling data is not directly related to the experimental conditions (different carbon sources and antibiotics), but in addition to these we do use measurements from experiments carried out in the same environment as the interactions assays (i.e. growth rate and carrying capacity when growing on glucose), which also showed poor correlation with the effects on focals. Additionally, we believe that these profiles contain relevant information regarding metabolic similarity between species (similar to metabolic models often constructed computationally). To improve clarity, we added the following sentence to the figure legend of Figure 3–Figure supplement 1:

      "The growth rate, and maximum OD shown in panel A were measured only in M9 glucose, similar to conditions used in the interaction assays." (Lines 591-592)

      Additionally and relatedly, it would be valuable to show the scatterplots leading to the conclusion that trait data were uninformative. Pearson's r only works on an assumption of linearity. But there could be strong relationships between the trait data and effect that are monotonic but not linear, or even that are non-monotonic yet still strong (e.g. U-shaped). For the first case, I recommend switching to Spearman's rho over Pearson's r, because it only assumes monotonicity, not linearity. If there are observable relationships that are not monotonic, a different test should be used.

      Per your suggestion, we have changed the measurement of correlation in this analysis from Pearson's r, to Spearman's rho. As we observed similar, and still mostly weak correlations, we did not investigate these relationships further. See Figure 3–Figure supplement 1.

      Additionally, we generated heat maps including scatterplots mapping the data leading to these correlations. We found no notable dependency in these plots, and visually they were quite crowded and difficult to interpret. As this is not the central point of our study, we ultimately decided against adding this information to the plots.

      In general, I think the analyses using the trait data were too simplistic to conclude that the trait data are not predictive.

      We agree that more sophisticated analyses may help connect between species traits and their effects on focal species. In fact, other members of our research group have recently used machine learning to accomplish similar predictions (https://doi.org/10.1101/2022.08.02.502471). As such we have changed the wording in to reflect that this correlation is difficult to find using simple analyses:

      "These results indicate that it may be challenging to connect the effects of single and pairs of species on a focal strain to a specific trait of the involved strains, using simple analysis." (Lines 157-159)

    1. Author Response

      Reviewer #1 (Public Review):

      Slusarczyk et al present a very well written manuscript focused on understanding the mechanisms underlying aging of erythrophagocytic macrophages in the spleen (RPM) and its relationship to iron loading with age. The manuscript is diffuse with a broad swath of data elements. Importantly, the manuscript demonstrates that RPM erythrophagocytic capacity is diminished with age, restored in iron restricted diet fed aged mice. In addition, the mechanism for declining RPM erythrophagocytic capacity appears to be ferroptosis-mediated, insensitive to heme as it is to iron, and occur independently of ROS generation. These are compelling findings. However, some of the data relies on conjecture for conclusion and a clear causal association is not clear. The main conclusion of the manuscript points to the accumulation of unavailable insoluble forms of iron as both causing and resulting from decreased RPM erythrophagocytic capacity.

      We are proposing that intracellular iron accumulation progresses first and leads to global proteotoxic damage and increased lipid peroxidation. This eventually triggers the death of a fraction of aging RPMs, thus promoting the formation of extracellular iron-rich protein aggregates. More explanation can be found below. Besides, iron loading suppresses the erythrophagocytic activity of RPMs, hence further contributing to their functional impairment during aging.

      In addition, the finding that IR diet leads to increased TF saturation in aged mice is surprising.

      We believe that this observation implies better mobilization of splenic iron stores, and corroborates our conclusion that mice that age on an iron-reduced diet benefit from higher iron bioavailability, although these differences are relatively mild. More explanation can be found in our replies to Reviewer #2.

      Furthermore, whether the finding in RPMs is intrinsic or related to RBC-related changes with aging is not addressed.

      We now addressed this issue and we characterized in more detail both iron and ROS levels in RBCs.

      Finally, these findings in a single strain and only female mice is intriguing but warrants tempered conclusions.

      We tempered the conclusions and provided a basic characterization of the RPM aging phenotype in Balb/c female mice.

      Major points:

      1) The main concern is that there is no clear explanation of why iron increases during aging although the authors appear to be saying that iron accumulation is both the cause of and a consequence of decreased RPM erythrophagocytic capacity. This requires more clarification of the main hypothesis on Page 4, line 17-18.

      We thank the reviewer for this comment. It was previously reported that iron accumulates substantially in the spleen during aging, especially in female mice (Altamura et al., 2014). Since RPMs are those cells that process most of the iron in the spleen, we aimed to explore what is the relationship between iron accumulation and RPM functions during aging. This investigation led us to uncover that indeed iron accumulation is both the cause and the consequence of RPM dysfunction. Specifically, we propose that intracellular iron loading of RPMs precedes extracellular deposition of iron in a form of protein-rich aggregates, driven by RPMs damage. To support this, we now show that the proteome of RPMs overlaps with those proteins that are present in the age-triggered aggregates (Fig. 3F). Furthermore, corroborating our model, we now demonstrate that transient iron loading of RPMs via iron-dextran injection (new Fig. 3G) leads to the formation of protein-rich aggregates, closely resembling those present in aged spleens (new Fig. 3H). This implies that high iron content in RPMs is indeed a major driving factor that leads to aggregation of their proteome and cell damage. Importantly, we now supported this model with studies using iRPMs. We demonstrated that iron loading and blockage of ferroportin by synthetic mini-hepcidin (PR73)(Stefanova et al., 2018) cause protein aggregation in iRPMs and lead to their decreased viability only in cells that were exposed to heat shock, a well-established trigger of proteotoxicity (new Fig. 5K and L). We propose that these two factors, namely age-triggered decrease in protein homeostasis and exposure to excessive iron levels, act in concert and render RPMs particularly sensitive to damage during aging (see also Discussion, p. 16).

      In parallel, our data imply that the increased iron content in aged RPMs drives their decreased erythrophagocytic activity, as we now better documented by more extensive in vitro experiments in iRPMs (new Fig 6E-H). We cannot exclude that some of the senescent splenic RBCs that are retained in the red pulp and evade erythrophagocytosis due to RPM defects in aging, may also contribute to the formation of the aggregates. This is supported by the fact that mice that lack RPMs as well exhibit iron loading in the spleen (Kohyama et al., 2009; Okreglicka et al., 2021), and that the proteome of aggregates overlaps to some extent with the proteome of erythrocytes (new Fig. 3F).

      We believe that during aging intracellular iron accumulation is chiefly driven by ferroportin downregulation, as also suggested by Reviewer#3. We now show that ferroportin drops significantly already in mice aged 4 and 5 months (new Fig. 4H), preceding most of the other impairments. This drop coincides with the increase in hepcidin expression, but if this is the sole reason for ferroportin suppression during early aging would require further investigation outside the scope of the present manuscript.

      In sum, to address this comment, we now modified the fragment of the introduction that refers to our hypothesis and major findings to be more clear (p. 4), we improved our manuscript by providing new data mentioned above and we added more explanation in the corresponding sections of the Results and Discussion.

      2) It is unclear if RPMs are in limited supply. Based on the introduction (page 4, line 13-15), they have limited self-renewal capacity and blood monocytes only partially replenished. Fig 4D suggests that there is a decrease in RPMs from aged mice. The %RPM from CD45+ compartment suggests that there may just be relatively more neutrophils or fewer monocytes recruited. There is not enough clarity on the meaning of this data point.

      Thank you for this comment. We fully agree that %RPMs of CD45+ splenocytes, although well-accepted in literature (Kohyama et al., 2009; Okreglicka et al., 2021), is only a relative number. Hence, we now included additional data and explanations regarding the loss of RPMs during aging.

      It was reported that the proportion of RPMs derived from bone marrow monocytes increases mildly but progressively during aging (Liu et al., 2019). This implies that due to the loss of the total RPM population, as illustrated by our data, the cells of embryonic origin are likely even more affected. We could confirm this assumption by re-analysis of the data from Liu et al. that we now included in the manuscript as Fig. 5E. These data clearly show that the representation of embryonically-derived RPMs drops more drastically than the percent of total RPMs, whereas the replenishment rate from monocytes is not affected significantly during aging. Consistent with this, we have not observed any robust change in the population of monocytes (F4/80-low, CD11b-high) or pre-RPMs (F4/80-high, CD11b-high) in the spleen at the age of 10 months (Figure 5-figure supplement 2A and B). We also have detected a mild decrease, not an increase, in the number of granulocytes (new Figure 5-figure supplement 2C). Furthermore, we measured in situ apoptosis marker and found a clear sign of apoptosis in the aged spleen (especially in the red pulp area), a phenotype that is less pronounced in mice on an IR diet (new Fig. 5O). This is consistent with the observation that apoptosis markers can be elevated in tissues upon ferroptosis induction (Friedmann Angeli et al., 2014) and that the proteotoxic stress in aged RPMs, which we now emphasized better in our manuscript, may also lead to apoptosis (Brancolini & Iuliano, 2020). Taken together, we strongly believe that the functional defect of embryonically-derived RPMs chiefly contributes to their shortage during aging.

      3) Anemia of aging is a complex and poorly understood mechanistically. In general, it is considered similar to anemia of chronic inflammation with increased Epo, mild drop in Hb, and erythroid expansion, similar to ineffective erythropoiesis / low Epo responsiveness. It is not surprising that IR diet did not impact this mild anemia. However, was the MCV or MCH altered in aged and IR aged mice?

      We now included the data for hematocrit, RBC counts, MCV, and MCH in Figure 1-figure supplement 5. Hematocrit shows a similar tendency as hemoglobin levels, but the values for RBC counts, MCV, and MCH seem not to be altered. We also show now that the erythropoietic activity in the bone marrow is not affected in aged versus young mice. Taken together, the anemic phenotype in female C57BL/6J mice at this age is very mild, which we emphasized in the main text, and is likely affected by other factors than serum iron levels (p. 6).

      4) Page 6, line 23 onward: the conclusion is that KC compensate for the decreased function of RPM in the spleen, based on the expansion of KC fraction in the liver. Is there evidence that KCs are engaged in more erythrophagocytosis in aged mice? Furthermore, iron accumulation in the liver with age does not demonstrate specifically enhanced erythrophagocytosis of KC. Please clarify why liver iron accumulation would not be simply a consequence of increased parenchymal iron similar to increased splenic iron with age, independent of erythrophagocytic activity in resident macrophages in either organ.

      Thanks for these questions. For the quantification of the erythrophagocytosis rate in KC, we show, as for the RPMs (Fig. 1K), the % of PKH67-positive macrophages, following transfusion of PKH67-stained stressed RBCs (Fig. 1M). The data implies a mild (not statistically significant) drop (of approx. 30%) in EP activity. We believe that it is overridden by a more pronounced (on average, 2-fold) increase in the representation of KCs (Fig. 1N). The mechanisms of iron accumulation between the spleen and the liver are very different. In the liver, we observed iron deposition in the parenchymal cells (not non-parenchymal, new Fig. 1P) that we currently characterizing in more detail in a parallel manuscript. Our data demonstrate a drop in transferrin saturation in aged mice. Hence, it is highly unlikely that aging would be hallmarked by the presence of circulating non-transferrin-bound iron that would be sequestered by hepatocytes, as shown previously (Jenkitkasemwong et al., 2015). Thus, the iron released locally by KCs is the most likely contributor to progressive hepatocytic iron loading during aging. The mechanism of iron delivery to hepatocytes from erythrophagocytosing KCs was demonstrated by Theurl et al.(Theurl et al., 2016), and we propose that it may be operational, although in a much more prolonged time scale, during aging. We now discussed this part better in our Results sections (p. 7).

      5) Unclear whether the effect on RPMs is intrinsic or extrinsic. Would be helpful to evaluate aged iRPMs using young RBC vs. young iRPMs using old RBCs.

      We are skeptical if the generation of iRPMs cells from aged mice would be helpful – these cells are a specific type of primary macrophage culture, derived from bone marrow monocytes with MCSF1, and exposed additionally to heme and IL-33 for 4 days. We do not expect that bone marrow monocytes are heavily affected by aging, and would thus recapitulate some aspects of aged RPMs from the spleen, especially after 8-day in vitro culture. However, to address the concerns of the reviewer, we now provide additional data regarding RBC fitness. Consistent with the time life-span experiment (Fig, 2A), we show that oxidative stress in RBCs is only increased in splenic, but not circulating RBCs (new Fig. 2C, replacing the old Fig. 2B and C). In addition, we show no signs of age-triggered iron loading in RBCs, either in the spleen (new Fig. 2F) or in the circulation (new Fig. 2B). Hence, we do not envision a possibility that RPMs become iron-loaded during aging as a result of erythrophagocytosis of iron-loaded RBCs. In support of this, we also have observed that during aging first RPMs’ FPN levels drop, afterward erythrophagocytosis rate decreases, and lastly, RBCs start to exhibit significantly increased oxidative stress (presented now in new Fig. 4H, J and K).

      6) Discussion of aggregates in the spleen of aged mice (Fig 2G-2K and Fig 3) is very descriptive and non-specific. For example, if the iron-rich aggregates are hemosiderin, a hemosiderin-specific stain would be helpful. This data specifically is correlatory and difficult to extract value from.

      Thanks for these comments. To the best of our knowledge Prussian blue Perls’ staining (Fig. 2J) is considered a hemosiderin staining. Our investigations aimed to better understand the nature and the origin of splenic iron deposits that to some extent are referred to as hemosiderin. Most importantly, as mentioned in our reply R1 Ad. 1. to assign causality to our data, we now demonstrated that iron accumulation in RPMs in response to iron-dextran (Fig. 3G) increases lipid peroxidation (Fig. 5F), tends to provoke RPMs depletion (Fig. 5G) and triggers the formation of protein-rich aggregates (new Fig. 3H). Of note, we assume that the loss of embryonically-derived RPMs in this model may be masked by simultaneous replenishment of the niche from monocytes, a phenomenon that may be addressed by future studies using Ms4a3-driven reporter mice (as shown for aged mice in our new Fig. 5E).

      7) The aging phenotype in RPMs appears to be initiated sometime after 2 months of age. However, there is some reversal of the phenotype with increasing age, e.g. Fig 4B with decreased lipid peroxidation in 9 month old relative to 6 month old RPMs. What does this mean? Why is there a partial spontaneous normalization?

      Thanks for this comment and questions. Indeed, the degree of lipid peroxidation exhibits some kinetics, suggestive of partial normalization. Of note, such a tendency is not evident for other aging phenotypes of RPMs, hence, we did not emphasize this in the original manuscript. However, in a revised version of the manuscript, we now present the re-analysis of the published data which implies that the number of embryonically-derived RPMs drops substantially between mice at 20 weeks and 36 weeks (new Fig. 5E). We think that the higher proportion of monocyte-derived RPMs in total RPM population later in aging (9 months) might be responsible for the partial alleviation of lipid peroxidation. We now discussed this possibility in the Results sections (p. 12).

      8) Does the aging phenotype in RPMs respond to ferristatin? It appears that NAC, which is a glutathione generator and can reverse ferroptosis, does not reverse the decreased RPM erythrophagocytic capacity observed with age yet the authors still propose that ferroptosis is involved. A response to ferristatin is a standard and acceptable approach to evaluating ferroptosis.

      We fully agree with the Reviewer that using ferristatin or Liproxstatin-1 would be very helpful to fully characterize a mechanism of RPMs depletion in mice. However, previous in vivo studies involving Liproxstatin-1 administration required daily injections of this ferroptosis inhibitor (Friedmann Angeli et al., 2014). This would be hardly feasible during aging. Regarding the experiments involving iron-dextran injection, using Liproxstatin-1 would require additional permission from the ethical committee which takes time to be processed and received. However, to address this question we now provide data from iRPMs cell cultures (new Fig.5 K-L). In essence, our results imply that both proteotoxic stress and iron overload act in concert to trigger cytotoxicity in RPM in vitro model. Interestingly, this phenomenon does not depend solely on the increased lipid peroxidation, but when we neutralize the latter with Liproxstatin-1, the cytotoxic effect is diminished (please, see also Results on p. 13 and Discussion p. 15/16).

      9) The possible central role for HO-1 in the pathophysiology of decreased RPM erythrophagocytic capacity with age is interesting. However, it is not clear how the authors arrived at this hypothesis and would be useful to evaluate in the least whether RBCs in young vs. aged mice have more hemoglobin as these changes may be primary drivers of how much HO-1 is needed during erythrophagocytosis.

      Thanks for this comment. We got interested in HO-1 levels based on the RNA sequencing data, which detected lower Hmox-1 expression in aged RPMs (Figure 3-figure supplement 1). We now show that the content of hemoglobin is not significantly altered in aged RBCs (MCH parameter, Figure 1-figure supplement 5E), hence we do not think that this is the major driver for Hmox-1 downregulation. Likewise, the levels of the Bach1 message, a gene encoding Hmox-1 transcriptional repressor, are not significantly altered according to RNAseq data. Hence, the reason for the transcriptional downregulation of Hmox-1 is not clear. Of note, HO-1 protein levels in the total spleen are higher in aged versus young mice, and we also detected a clear appearance of its nuclear truncated and enzymatically-inactive form (see a figure below, we opt not to include this in the manuscript for better clarity). The appearance of truncated HO-1 seems to be partially rescued by the IR diet. It is well established that the nuclear form of HO-1 emerges via proteolytic cleavage and migrates to the nucleus under conditions of oxidative stress (Mascaro et al., 2021). This additionally confirms that the aging spleen is hallmarked by an increased burden of ROS. Moreover, we also detected HO-1 as one of the components of the protein iron-rich aggregates. Thus, we propose that the low levels of the cytoplasmic enzymatically active form of HO-1 in RPMs (that we preferentially detect with our intracellular staining and flow cytometry) may be underlain by its nuclear translocation and sequestration in protein aggregates that evade antibody binding [this is also supported by our observation that the protein aggregates, despite the high content of ferritin (as indicated by MS analysis) are negative for L-ferritin staining. Of note, we also cannot exclude that other cell types in the aging spleen (eg. lymphocytes) express higher levels of HO-1 in response to splenic oxidative stress.

      Fig. Total splenic levels of HO-1 in young, aged IR and aged mice.

      Reviewer #2 (Public Review):

      Slusarczyk et al. investigate the functional impairment of red pulp macrophages (RPMs) during aging. When red blood cells (RBCs) become senescent, they are recycled by RPMs via erythrophagocytosis (EP). This leads to an increase in intracellular heme and iron both of which are cytotoxic. The authors hypothesize that the continuous processing of iron by RPMs could alter their functions in an age-dependent manner. The authors used a wide variety of models: in vivo model using female mice with standard (200ppm) and restricted (25ppm) iron diet, ex vivo model using EP with splenocytes, and in vitro model with EP using iRPMs. The authors found iron accumulation in organs but markers for serum iron deficiency. They show that during aging, RPMs have a higher labile iron pool (LIP), decreased lysosomal activity with a concomitant reduction in EP. Furthermore, aging RPMs undergo ferroptosis resulting in a non-bioavailable iron deposition as intra and extracellular aggregates. Aged mice fed with an iron restricted diet restore most of the iron-recycling capacity of RPMs even though the mild-anemia remains unchanged.

      Overall, I find the manuscript to be of significant potential interest. But there are important discrepancies that need to be first resolved. The proposed model is that during aging both EP and HO-1 expression decreases in RPMs but iron and ferroportin levels are elevated. In their model, the authors show intracellular iron-rich proteinaceous aggregates. But if HO-1 levels decrease, intracellular heme levels should increase. If Fpn levels increase, intracellular iron levels should decrease. How does LIP stay high in RPMs under these conditions? I find these to be major conflicting questions in the model.

      We thank the Reviewer for her/his valuable feedback. As we mentioned in our replies we can only assume that a small misunderstanding in the interpretation of the presented data underlies this comment. We show that ferroportin levels in RPMs (Fig. 1F) are modulated in a manner that fully reflects the iron status of these cells (both labile and total iron levels, Figs. 1H and I). FPN levels drop in aged RPMs and are rescued when mice are maintained on a reduced iron diet. As pointed out by Reviewer#3, and explained in our replies we believe that ferroportin levels are critical for the observed phenotypes in aging. We now described our data in a more clear way to avoid any potential misinterpretation (p.6).

      Reviewer #3 (Public Review):

      This is a comprehensive study of the effects of aging of the function of red pulp macrophages (RPM) involved in iron recycling from erythrocytes. The authors document that insoluble iron accumulates in the spleen, that RPM become functionally impaired, and that these effects can be ameliorated by an iron-restricted diet. The study is well written, carefully done, extensively documented, and its conclusions are well supported. It is a useful and important addition for at least three distinct fields: aging, iron and macrophage biology.

      The authors do not explain why an iron-restricted diet has such a strong beneficial effect on RPM aging. This is not at all obvious. I assume that the number of erythrocytes that are recycled in the spleen, and are by far the largest source of splenic iron, is not changed much by iron restriction. Is the iron retention time in macrophages changed by the diet, i.e. the recycled iron is retained for a short time when diet is iron-restricted (making hepcidin low and ferroportin high), and long time when iron is sufficient (making hepcidin high and ferroportin low)? Longer iron retention could increase damage and account for the effect. Possibly, macrophages may not empty completely of iron before having to ingest another senescent erythrocyte, and so gradually accumulate iron.

      We are very grateful to this Reviewer for emphasizing the importance of the iron export capacity of RPMs as a possible driver of the observed phenotypes. Indeed, as mentioned above, we now show in the revised version of the manuscript that ferroportin drops early during aging (revised Fig. 4). Importantly, we now also observed that iron loading and limitation of iron export from iRPMs via ferroportin aggravate the impact of heat shock (a well-accepted trigger of proteotoxicity) on both protein aggregation and cell viability (new Fig. 5K and L). Physiologically, recent findings show that aging promotes a global decrease in protein solubility [BioRxiv manuscript (Sui X. et al., 2022)], and it is very likely that the constant exposure of RPMs to high iron fluxes renders these specialized cells particularly sensitive to proteome instability. This could be further aggravated by a build-up of iron due to the drop of ferroportin early during aging, ultimately leading to the appearance of the protein aggregates as early as at 5 months of age in C57BL/6J females. Based on the new data, we emphasized this model in the revised version of the manuscript (please, see Discussion on p. 16)

    1. Author Response

      Reviewer #1 (Public Review):

      1) It would be helpful to include some sort of comparison in Fig. 4, e.g. the regressions shown in Fig 3, to indicate to what extent the ICCl data corresponds to the "control range" of frequency tuning.

      Figure 4 was modified to show the frequency range typically found in the ICCls. This range is based on results from Wagner et al., 2007, which extensively surveyed ICCls responses. This modification shows that our ICCls recordings in the ruff-removed owls cover the normal frequency hearing range of the owl.

      2) A central hypothesis of the study is that the frequency preference of the high-frequency neurons is lower in ruff-removed owls because of the lowered reliability caused by a lack of the ruff. Yet, while lower, the frequency range of many neurons in juvenile and ruff-removed owls seems sufficiently high to be still responsive at 7-8 kHz. I think it would be important to know to what extent neurons are still ITD sensitive at the "unreliable high frequencies" even if the CFs are lower since the "optimization" according to reliability depends not on the best frequency of each neuron per se, but whether neurons are less ITD sensitive at the higher, less reliable frequencies.

      The concern regarding the frequency range that elicits responsivity was largely addressed above. Specifically, Figure L1 showing frequency tuning of frontally tuned ICx neurons in ruff-removed owls indicates that while there is some variability of tuning across neurons, there is little responsivity above 6 kHz. In contrast, equivalent analysis in juvenile owls (Figure L3), shows there is much more responsiveness and variability across neurons to high and low frequencies. This evidence supports our hypothesis that the juvenile owl brain is still highly plastic, which facilitates learning during development. Although the underlying data was already reported in Figure 7 of our previously submitted manuscript, we can include Figures L1 and L2, potentially as supplemental figures, if considered useful by editors and reviewers. Nevertheless, this argumentation was further expanded in the revised text (Line 229).

      Figure L1. Frequency tuning of frontally-tuned ICx neurons in ruff-removed owls. Tuning curves are normalized by the max response. Thick black line indicates the average tuning curve. Dashed black line indicates basal response.

      Figure L2. ITD sensitivity across frequencies in ruff-removed owl. Two example neurons shown in a and b. ITD tuning for tones (colored) and broadband (black) plotted by firing rate (non-normalized). Solid colored lines indicate responses to frequencies that are within the neuron’s preferred frequency range (i.e. above the half-height, see Methods), dashed lines indicate frequencies outside of the neuron’s frequency range.

      Figure L3. Frequency tuning of frontally-tuned ICx neurons in juvenile owls. Tuning curves are normalized by the max response. Thick black line indicates the average tuning curve. Dashed black line indicates basal response.

      3) It would be interesting to have an estimate of the time scale of experience dependency that induces tuning changes. Do the authors have any data on this question? I appreciate the authors' notion that the quantifications in Fig 7 might indicate that juvenile owls are already "beginning to be shaped by ITD reliability" (line 323 in Discussion). How many days after hearing onset would this correspond to? Does this mean that a few days will already induce changes?

      While tracking changes induced by ruff-removal over development were outside of the scope of this study, many other studies have assessed experience-dependent plasticity in the barn owl. The recordings in this study were performed approximately 20 days after hearing onset, suggesting that the juveniles had ample time to begin learning. These points were expanded upon in the discussion (Lines 254, 280-283).

      Reviewer #2 (Public Review):

      1) Why is IPD variability plotted instead of ITD variability (or indeed spatial reliability)? The relationship between these measures is likely to vary across frequency, which makes it difficult to compare ITD variability across frequency when IPDs are plotted. Normalizing data across frequencies also makes it difficult to compare different locations and acoustical conditions. For example, in Fig.1a and Fig.1b, the data shown for 3 kHz at ~160 degrees seems quantitatively and visually quite different, but the difference (in Fig.1c) appears to be negligible.

      Justification of why IPD variability is used as an estimate of ITD variability was added to introduction (Lines 55-60), results (Line 100) and methods (Lines 371-374) sections of the manuscript, explaining the fact that because ITD detection is based on phase locking by auditory nerve and ITD detector neurons tuned to narrow frequency bands, responses of ITD detector neurons forwarded to downstream midbrain regions are therefore determined by IPD variability. Additionally, ITD is calculated by dividing IPD by frequency, which makes comparisons of ITD reliability across frequency mathematically uninformative.

      2) How well do the measures of ITD reliability used reflect real-world listening? For example, the model used to calculate ITD reliability appears to assume the same (flat) spectral profile for targets and distractors, which are presented simultaneously with the same temporal envelope, and a uniform spatial distribution of sounds across space. It is therefore unclear how robust the study's results are to violations of these assumptions.

      While we agree that our analysis cannot completely capture real-world listening for the barn owl, a general analysis using similar flat spectral profiles for targets and concurrent sounds provides a broad assessment of reliability of ITD cues. While a full recapitulation of real-world listening is beyond the scope of this study (i.e. recording natural scenes from the ear canals of wild barn owls), we included additional analyses of ITD reliability in Figure 1-figure supplement 1, described above.

      3) Does facial ruff removal produce an isolated effect on ITD variability or does it also produce changes in directional gain, and the relationship between spatial cues and sound location? Although the study considers this issue in some places (e.g. Fig.2, Fig.5), a clearer presentation of the acoustical effects of facial ruff removal and their implications (for all locations, not just those to the front), as well as an attempt to understand how these acoustical changes lead to the observed changes in ITD reliability, would greatly strengthen the study. In addition, Fig.1 shows average ITD reliability across owls, but it would be helpful to know how consistent these measures are across owls, given individual variability in Head-Related Transfer Functions (HRTFs). This potentially has implications for the electrophysiological experiments, if the HRTFs of those animals were not measured. One specific question that is potentially very relevant is whether the facial ruff attenuates sounds presented behind the animal and whether it does so in a frequency-dependent way. In addition, if facial ruff removal enables ILDs to be used for azimuth, then ITDs may also become less necessary at higher frequencies, even if their reliability remains unchanged.

      Additional analysis was conducted to generate representation of changes in directional gain induced by ruff removal, added to new figure (Fig 5). This analysis shows that changes in gain following ruff-removal are largely frequency-independent: there is a de-attenuation of peripherally and rearwardly located sounds, but the highest gain remains for high frequencies in frontal space. There is an additional increase in gain for high frequencies from rearward space, these changes would not explain the changes in frequency tuning we report. As mentioned in new additions to the manuscript, the changes at the most rearward-located auditory spatial locations are unlikely to have an effect on the auditory midbrain. No studies in the barn owl have found neurons in the ICx or optic tectum tuned to >120° (Knudsen, 1982; Knudsen, 1984; Cazettes et al., 2014). In addition, variability of IPD reliability across owls was analyzed and reported in the amended Figure 1, which notes very little changes across owls. In this analysis, we did realize that the file of one of the HRTFs obtained from von Campenhausen et al. 2006 was mislabeled, which explains slight differences in revised Fig 1b. Nevertheless, added analysis of IPD reliability across owls indicates that the pattern in ITD reliability is stable across owls (Fig. 1d,e), which supports our decision to not record HRTFs from owls used in this study. Finally, we added to the discussion that clarifies that the use of ILD for azimuth would not provide the same resolution as ITD would (Lines 295-303). We also do not believe that the use of ILD for azimuth would make “ITDs… less necessary at higher frequencies”, given that the ICCls is still computing ITD at these high frequencies (Fig 4), and that ILDs also have higher resolution at higher frequencies, with and without the facial ruff (Olsen et al, 1989; Keller et al., 1998; von Campenhausen et al., 2006).

      1) It is unclear why some analyses (Fig.5, Fig.7) are focused on frontal locations and frontally-tuned neurons. It is also unclear why neurons with a best ITDs of 0 are described as frontally tuned since locations behind the animal produce an ITD of 0 also. Related to this, in Fig.1, facial ruff removal appears to reduce IPD variability at low frequencies for locations to the rear (~160 degrees), where the ITD is likely to be close to 0. Neurons with a best ITD of 0 might therefore be expected to adjust their frequency tuning in opposite directions depending on whether they are tuned to frontal or rearward locations.

      An extensive explanation was added to the methods detailing why we do not believe the neurons recorded in this study are tuned to the rear. Namely, studies mapping the barn owl’s ICx and optic tectum have not reported neurons tuned to locations >120°, with the number of neurons representing a given spatial location decreasing with eccentricity (Knudsen, 1982; Knudsen, 1984; Cazettes et al., 2014). While we agree that there does seem to be a change in ITD reliability at ~160° following ruff-removal, the result is largely similar to the change that occurs in frontal space (Fig 1b), which is consistent with the ruff-removed head functioning as a sphere. Thus, we wouldn’t expect rearwardly-tuned neurons, if they could be readily found, to adjust their frequency tuning to higher frequencies. Finally, we want to clarify that we focused our analyses on frontally-tuned neurons because frontal space is where we observed the largest change in ITD reliability. Text was added to the Discussion section to clarify this point (Lines 313-321).

      2) The study suggests that information about high-frequency ITDs is not passed on to the ICX if the ICX does not contain neurons that have a high best frequency. However, neurons might be sensitive to ITDs at frequencies other than the best frequency, particularly if their frequency tuning is broader. It is also unclear whether the best frequency of a neuron always corresponds to the frequency that provides the most reliable ITD information, which the study implicitly assumes.

      The concern about ITD sensitivity at non-preferred frequencies was addressed under the essential revision #3, as well as under Reviewer 1’s concerns.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript reports a systematic study of the cortical propagation patterns of human beta bursts (~13-35Hz) generated around simple finger movements (index and middle finger button presses).

      The authors deployed a sophisticated and original methodology to measure the anatomical and dynamical characteristics of the cortical propagation of these transient events. MEG data from another study (visual discrimination task) was repurposed for the present investigation. The data sample is small (8 participants). However, beta bursts were extracted over a +/- 2s time window about each button press, from single trials, yielding the detection and analysis of hundreds of such events of interest. The main finding consists of the demonstration that the cortical activity at the source of movement related beta bursts follows two main propagation patterns: one along an anteroposterior directions (predominantly originating from pre central motor regions), and the other along a medio- lateral (i.e., dorso lateral) direction (predominantly originating from post central sensory regions). Some differences are reported, post-hoc, in terms of amplitude/cortical spread/propagation velocity between pre and post-movement beta bursts. Several control tests are conducted to ascertain the veracity of those findings, accounting for expected variations of signal-to-noise ration across participants and sessions, cortical mesh characteristics and signal leakage expected from MEG source imaging.

      One major perceived weakness is the purely descriptive nature of the reported findings: no meaningful difference was found between bursts traveling along the two different principal modes of propagation, and importantly, no relation with behavior (response time) was found. The same stands for pre vs. post motor bursts, except for the expected finding that post-motor bursts are more frequent and tend to be of greater amplitude (yielding the observation of a so-called beta rebound, on average across trials).

      Overall, and despite substantial methodological explorations and the description of two modes of propagation, the study falls short of advancing our understanding of the functional role of movement related beta bursts.

      For these reasons, the expected impact of the study on the field may be limited. The data is also relatively limited (simple button presses), in terms of behavioral features that could be related to the neurophysiological observations. One missed opportunity to explain the functional role of the distinct propagation patterns reports would have been, for instance, to measure the cortical "destination" of their respective trajectories.

      In response to this comment, we would like to highlight two important points.

      First, our work constitutes the first non-invasive human confirmation of invasive work in animals (Balasubramanian et al., 2020; Roberts et al., 2019; Rule et al., 2018; (Balasubramanian et al., 2020; Best et al., 2016; Rubino et al., 2006; Takahashi et al., 2011, 2015) and patients (Takahashi et al., 2011). Thus, these results bridges between recordings limited to the size of multielectrode arrays (roughly 0.16 cm2; Balasubramanian et al., 2020; Best et al., 2016; Rubino et al., 2006; Takahashi et al., 2011, 2015) and human EEG recordings spanning across large areas of the cortex and several functionally distinct regions (Alexander et al., 2016; Stolk et al., 2019). The ability to access these neural signatures non- invasively is important for cross-species comparison. This further enables us, to provide an in-depth analysis of the spatiotemporal diversity of human MEG signals and a detailed characterisation of the two propagation directions, which significantly extends previous reports. We note that their functional role remains undetermined also in these animal studies, but being able to identify these signals now in humans can provide a steppingstone for identifying their role.

      Second, and related, the reviewers are correct that we did not observe distinct propagation directions between pre- and post-movement bursts, nor a relationship with reaction time. However, such a null result would be relevant, in our view, towards understanding what the functional relevance of these signals, if any, might be. Recent work in macaques indicates that the spatiotemporal patterns of high-gamma activity carry kinematic information about the upcoming movement (Liang et al 2023). The functional role of beta may therefore be more complex and not relate to reaction times or kinematics in a straightforward manner. We believe this is a relevant observation, and in keeping with the continued efforts to identify how sensorimotor beta relates to behaviour. It is increasingly clear that spatiotemporal diversity in animal recordings and human E/MEG and intracranial recordings can constitute a substantial proportion of the measured dynamics. As such, our report is relevant in narrowing down what these signals may reflect.

      Together, we think that our work provides new insights into the multidimensional and propagating features of burst activity. This is important for the entire electrophysiology community, as it transforms how we commonly analyse and interpret these important brain signals. We anticipate that our work will guide and inspire future work on the mechanistic underpinnings of these dominant neural signals. We are confident that our article has the scope to reach out to the diverse readership of eLife.

      Reviewer #2 (Public Review):

      The authors devised novel and interesting experiments using high precision human MEG to demonstrate the propagation of beta oscillation events along two axes in the brain. Using careful analysis, they show different properties of beta events pre- and post movement, including changes in amplitude. Due to beta's prominent role in motor system dynamics, these changes are therefore linked to behavior and offer insights into the mechanisms leading to movement. The linking of wave-like phenomena and transient dynamics in the brain offers new insight into two paradigms about neural dynamics, offering new ways to think about each phenomena on its own.

      Although there is a substantial, and recent, body of literature supporting the conclusions that beta and other neural oscillations are transient, care must be taken when analyzing the data and the resulting conclusions about beta properties in both time and space. For example, modifying the threshold at which beta events are detected could alter their reported properties and expression in space and time. The authors should therefore performing parameter sweeps on e.g. the thresholds for detection of oscillation bursts to determine whether their conclusions on beta properties and propagation hold. If this additional analysis does not change their story, it would lend confidence in the results/conclusions.

      We thank the reviewing team for this comment. As suggested, we evaluated the effect of different burst thresholds on the burst parameters.

      The threshold in the main analysis was determined empirically from the data, as in previous work (Little et al., 2019). Specifically, trial-wise power was correlated with the burst probability across a range of different threshold values (from median to median plus seven standard deviations (std), in steps of 0.25, see Figure 6-figure supplement 1). The threshold value that retained the highest correlation between trial-wise power and burst probability was used to binarize the data.

      We repeated our original analysis using four additional thresholds, i.e., original threshold - 0.5 std, -0.25 std, +0.25 std, +0.5 std. As one would expect, burst threshold is negatively related to the number of bursts (i.e., higher thresholds yield fewer bursts, Figure R4a [top]), and positively related to burst amplitude (i.e., higher thresholds yield higher burst amplitudes, Figure R4a [bottom]).

      Similarly, the temporal duration of bursts and apparent spatial width are modulated by the burst threshold: lowering the threshold leads to longer temporal duration and larger apparent spatial width while increasing the threshold leads to shorter temporal duration and smaller apparent spatial width Figure R4b. Note that for the temporal and spectral burst characteristics, the difference to the original threshold can be numerically zero, i.e., changing the burst threshold did not lead to changes exceeding the temporal and spectral resolution of the applied time-frequency transformation (i.e., 200ms and 1Hz respectively).

      Importantly, across these threshold values, the propagation direction and propagation speed remain comparable.

      We now include this result as Figure 6-figure supplement 2and refer to this analysis in the manuscript (page 28 line 717).

      “To explore the robustness of the results analyses were repeated using a range of thresholds (Figure 6-figure supplement 2).”

      Determining the generators of beta events at different locations is a tricky issue. The authors mentioned a single generator that is responsible for propagating beta along the two axes described. However, it is not clear through what mechanism the beta events could travel along the neural substrate without additional local generators along the way. Previous work on beta events examined how a sequence of synaptic inputs to supra and infragranular layers would contribute to a typical beta event waveform. Although it is possible other mechanisms exist, how might this work as the beta events propagate through space? Some further explanation/investigation on these issues is therefore warranted.

      Based on this and other comments (i.e., comments 7 and 8) we re-evaluated the use of the term ‘generator’ in this manuscript.

      While the term generator can be used across scales, from micro- to macroscale, ifor the purpose of the present paper, we believe one should differentiate at least two concepts: a) generator of beta bursts, and b) generator of travelling waves.

      We realised that in the previous version of the manuscript the term ‘generator’ was at times used without context. We removed the term where no longer necessary.

      Further, the previous version of the manuscript discussed putative generators of travelling waves (page 19f.) but not generators of beta bursts. We now address this as follows:

      “Studies using biophysical modelling have proposed that beta bursts are generated by a broad infragranular excitatory synaptic drive temporally aligned with a strong supragranular synaptic drive (Law et al., 2022; Neymotin et al., 2020; Sherman et al., 2016; Shin et al., 2017) whereby layer specific inhibition acts to stabilise beta bursts in the temporal domain (West et al., 2023). The supragranular drive is thought to originate in the thalamus (E. G. Jones, 1998, 2001; Mo & Sherman, 2019; Seedat et al., 2020), indicating thalamocortical mechanisms (page 22f).”

      Once the mechanisms have been better understood, a question of how much the results generalize to other oscillation frequencies and other brain areas. On the first question of other oscillation frequencies, the authors could easily test whether nearby frequency bands (alpha and low gamma) have similar properties. This would help to determine whether the observations/conclusions are unique to beta, or more generally applicable to transient bursts/waves in the brain. On the second issue of applicability to other brain areas, the authors could relate their work to transient bursts and waves recorded using ECoG and/or iEEG. Some recent work on traveling waves at the brain-wide level would be relevant for such comparisons.

      We appreciate the enthusiasm and the suggestions. To comment on the frequency specificity of the observed effects we conducted the same analysis focusing on the gamma frequency range (60-90 Hz). For computational reasons, we limited this analysis to one subject. Figure R1 shows the polar probability histogram for the beta frequency range (left) and the gamma frequency range (right). In contrast to the beta frequency range, no dominant directions were observed for the gamma range and von Mises functions did not converge. These preliminary results suggest some frequency specificity of the spatiotemporal pattern in sensorimotor beta activity. We believe this paves the way for future analysis mapping propagation direction across frequency and space.

      Here we did not investigate the spatial specificity of the effects, as the beta frequency range is dominant in sensorimotor areas. Investigating beta bursts in other cortical areas would have likely resulted in very few bursts. We discuss our results across spatial scales in the section: Distinct anatomical propagation axes of sensorimotor beta activity. However, please note that most of the previous literature operates on a different spatial scale (roughly 4mm; Balasubramanian et al., 2020; Best et al., 2016; Rubino et al., 2006; Rule et al., 2018; Takahashi et al., 2011, 2015) and different species (e.g., non-human primates). Non-invasive recordings in humans capture temporospatial patterns of a very different scale, i.e., often across the whole cortex (Alexander et al., 2016; Roberts et al., 2019). Comparing spatiotemporal patterns, across different spatial scales is inherently difficult. Work

      investigating different spatial scales simultaneously, such as Sreekumar et al. 2020, is required to fully unpack the relationship between mesoscopic and macroscopic spatiotemporal patterns.

      Figure R1: Spatiotemporal organisation for the beta (β, 13-30Hz) and gamma (γ, 60-90) frequency range for one exemplar subject. Same as Figure 4a, but for one exemplar subject.

      If the source code could be provided on github along with documentation and a standard "notebook" on use other researchers would benefit greatly.

      All analyses are performed using freely available tools in MATLAB. The code carrying out the analysis in this paper can be found here: [link provided upon acceptance]. The 3D burst analyses can be very computationally intensive even on a modern computer system. The analyses in this paper were computed on a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 and 32 Gb of RAM. Details on the installation and setup of the dependencies can be found in the README.md file in the main study repository.

      This information has been added to the paper in the methods section on page 35.